Bug 1118442

Summary:	arequal-checksum of mount after gluster compilation mismatches with subsequent arequal-checksum after heal.
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	spandura
Component:	replicate	Assignee:	Pranith Kumar K <pkarampu>
Status:	CLOSED NOTABUG	QA Contact:
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.0	CC:	rhs-bugs, storage-qa-internal
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-07-18 05:10:44 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description spandura 2014-07-10 17:52:00 UTC

Description of problem:
========================
ON a 2 x 2 distribute replicate volume, Started cloning of glusterfs.git and gluster compilation on the fuse mount. When compilation is in progress, brought down one of the bricks. While compilation is still in progress brought back the brick online. 

When compilation completed, calculated the arequal-checksum from the mount. 

After the heal is complete, calculated arequal-checksum once again from the mount.

The arequal-checksums calculated after compilation and after the self-heal should be same. But they are mismatching. 

Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.6.0.24-1.el6_5.x86_64

How reproducible:
====================
Often

Steps to Reproduce:
=====================
1. Create a dis-rep volume ( 2 x 2 ). Start the volume

2. Create fuse mount. From fuse mount start git clone and compilation. 
( git clone git://github.com/gluster/glusterfs.git glusterfs_1 )
( cd glusterfs_1 ; ./autogen.sh ; ./configure CFLAGS='-g3 -O0 -DDEBUG' ; make )

3.while glusterfs cloning is in progress 

bring down a brick (kill -KILL <brick_pid>); 
sleep 60 ; 
bring back the brick (service glusterd restart) ; 
sleep 30 ; 
bring down the same brick once again  (kill -KILL <brick_pid>); 
bring back brick after some time while the compilation is still in progress. 

4. Once the compilation completes on mount calculate the arequal-checksum

5. Keep monitoring the self-heal, when the self-heal is complete, calculate arequal-checksum

NOTE: in the case where the mismatch is seen, the self-heal had completed well before the compilation completion. 

Actual results:
======================
Arequal mismatch when "brick1" was brought down and self-heal was triggered. 

:: [   FAIL   ] :: Files /arequal-data/rhsauto048.lab.eng.blr.redhat.com_gluster-mount_arequal_checksum_before.log and /arequal-data/rhsauto048.lab.eng.blr.redhat.com_gluster-mount_arequal_checksum_after.log should not differ 
:: [ 10:26:31 ] ::  arequal checksum of before

Entry counts
Regular files   : 3100
Directories     : 386
Symbolic links  : 67
Other           : 0
Total           : 3553

Metadata checksums
Regular files   : 3e9
Directories     : 3e9
Symbolic links  : 5a815a
Other           : 3e9

Checksums
Regular files   : 93b3b0e997c69380509609c7b5ce63e4
Directories     : 6176d153f506f45
Symbolic links  : 5162560c522b383e
Other           : 0
Total           : 945082374f73a71f
:: [ 10:26:31 ] ::  arequal checksum of after

Entry counts
Regular files   : 3101
Directories     : 386
Symbolic links  : 67
Other           : 0
Total           : 3554

Metadata checksums
Regular files   : 486e85
Directories     : 3e9
Symbolic links  : 5a815a
Other           : 3e9

Checksums
Regular files   : 3d42771862109ffc3d7be933bc768f92
Directories     : 18325967433f4d1c
Symbolic links  : 5162560c522b383e
Other           : 0
Total           : 49699140cf72654c

Arequal mismatch when "brick2" was brought down and self-heal was triggered.

:: [   FAIL   ] :: Files /arequal-data/rhsauto048.lab.eng.blr.redhat.com_gluster-mount_arequal_checksum_before.log and /arequal-data/rhsauto048.lab.eng.blr.redhat.com_gluster-mount_arequal_checksum_after.log should not differ 
:: [ 11:09:48 ] ::  arequal checksum of before

Entry counts
Regular files   : 6202
Directories     : 771
Symbolic links  : 134
Other           : 0
Total           : 7107

Metadata checksums
Regular files   : 3e9
Directories     : 24d74c
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : 241bc450f0cce74d32793e6ca5bc9e06
Directories     : 3424b000a124a08
Symbolic links  : 0
Other           : 0
Total           : 1520b13c5f623343
:: [ 11:09:48 ] ::  arequal checksum of after

Entry counts
Regular files   : 6202
Directories     : 771
Symbolic links  : 134
Other           : 0
Total           : 7107

Metadata checksums
Regular files   : 3e9
Directories     : 24d74c
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : f2d8fab12b8841042115d4f80bebb945
Directories     : 3424b000a124a08
Symbolic links  : 0
Other           : 0
Total           : d08f65492a71b249

Check the client logs for the test case 237922 in the provided Beaker job link for more failures and grep for "FAIL" keyword. (client test case log:  http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2014/07/6885/688572/1430504/22487893/237922.log )


Expected results:
===================
arequal-checksums should match

Additional info:
====================
Link to the job result which failed the gluster compilation case : https://beaker.engineering.redhat.com/jobs/688572

TCMS Case : 237922

SOS Reports : http://rhsqe-repo.lab.eng.blr.redhat.com/BeakerSosReports/688572/

Comment 2 Pranith Kumar K 2014-07-15 17:26:11 UTC

Shwetha,
     These are some initial observations I made. Need your inputs about how the script works.

>:: [ 10:25:11 ] ::  Total Number of files and directories in the volume : 3554

Seems like the expected number of entries in the volume is 3554

>:: [ 10:26:31 ] ::  arequal checksum of before

>Entry counts
>Regular files   : 3100
>Directories     : 386
>Symbolic links  : 67
>Other           : 0
>Total           : 3553

>:: [ 10:26:31 ] ::  arequal checksum of after

>Entry counts
>Regular files   : 3101
>Directories     : 386
>Symbolic links  : 67
>Other           : 0
>Total           : 3554

According to these numbers, number of entries seems to be correct only after the self-heal. I think the question we need to ask now is why is the number '1' less before the self-heal started?

I see one more log:
:: [ 10:26:39 ] ::  Checking if there are additional entries under /gluster-mount  after to before
:: [   FAIL   ] :: Additional entries found under /gluster-mount  after to before 
:: [ 10:26:39 ] ::  Listing all the Additional entries found under /gluster-mount  after to before
+/gluster-mount/rhsauto048.lab.eng.blr.redhat.com_gluster-mount_collect_entries_after.log
:: [ 10:26:47 ] ::  Total Number of files and directories in the volume : 3555

I am not sure about how the test-script is implemented completely but it seems like the test also is creating files on the mount point? Although the test above says there are additional files, is it possible that the additional file is created by the 'test-run' itself?

I see the following log in glusterd logs:
rhsauto035-2014070823081404860924/var/log/glusterfs/etc-glusterfs-glusterd.vol.log:[2014-07-08 10:25:30.488695] E [glusterd-syncop.c:1204:gd_stage_op_phase] 0-management: Staging of operation 'Volume Heal' failed on localhost : Command not supported. Please use "gluster volume heal healtest info" and logs to find the heal information.

Are we executing any info healed/heal-failed commands?

If we look at 'rhsauto035-2014070823081404860924/var/log/glusterfs/glfsheal-healtest.log' The successive executions of 'heal info' commands seem to be happening 16 minutes apart. How does the script monitor the progress of self-heals?

[2014-07-08 10:09:57.244794] W [client-rpc-fops.c:2758:client3_3_lookup_cbk] 0-healtest-client-0: remote operation failed: No such file or directory. P
ath: bfc2e706-8d1f-4ac0-ac06-ac418aa15176 (bfc2e706-8d1f-4ac0-ac06-ac418aa15176)
[2014-07-08 10:09:57.245535] I [afr-self-heal-common.c:2147:afr_sh_post_nb_entrylk_missing_entry_sh_cbk] 0-healtest-replicate-0: Non blocking entrylks 
failed.
[2014-07-08 10:25:25.198457] I [dht-shared.c:334:dht_init_regex] 0-healtest-dht: using regex rsync-hash-regex = ^\.(.+)\.[^.]+$
[2014-07-08 10:25:25.211309] I [glfs-master.c:93:notify] 0-gfapi: New graph 72687361-7574-6f30-3335-2e6c61622e65 (0) coming up
[2014-07-08 10:25:25.211358] I [client.c:2280:notify] 0-healtest-client-0: parent translators are ready, attempting connect on transport

Pranith.

Comment 3 spandura 2014-07-16 14:06:46 UTC

Pranith, 

The number of entries on mount i.e the output :
:: [ 10:26:47 ] ::  Total Number of files and directories in the volume : 3555

is calculated using the command : "number_of_entries=$(find "${MOUNT_POINT_CLIENT}" | wc -l)"

Comment 4 spandura 2014-07-16 14:11:37 UTC

1) We do execute "healed" and "healed-failed" to check it outputs "Command Not Supported". 

2) "heal info" is executed multiple times. 

a. Before bringing the brick online we execute "heal info" to get the summary i.e number of entries to heal

b. Once the brick is brought online, we monitor the "indices/xattrop/" directory of each brick in the volume until it becomes zero or it reaches monitor timeout period i.e 30 minutes. 

c. we once again execute "heal info" to check for "Number of entries" : "0" from all the bricks so that there are no pending self-heals.

Comment 5 Pranith Kumar K 2014-07-17 10:20:13 UTC

Shwetha and I observed that one of the files is being created on the mount in between the runs which could cause arequal mismatch. Shwetha made the changes to the script and started the run just to make sure the issue is not because of this. Will resume debugging after that run based on the results.

Comment 6 spandura 2014-07-18 05:10:44 UTC

The Automation script was creating an extra file on the mount after the compilation. Fixed the script to create the file in a separate directory than mount point itself. Arequal-checksums now matches after compilation with subsequent arequal-checksum after heal.

Moving the bug to "CLOSED" state. 

Link to the JOb which ran the test case: https://beaker.engineering.redhat.com/jobs/697049