Bug 961704

Summary: are-equal check sum after rebalance giving error :Transport endpoint is not connected
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: senaik
Component: glusterfsAssignee: Ravishankar N <ravishankar>
Status: CLOSED ERRATA QA Contact: senaik
Severity: high Docs Contact:
Priority: medium    
Version: 2.1CC: amarts, asriram, kaushal, kparthas, rhs-bugs, sdharane, senaik, sgowda, vbellur
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.4.0.34rhs Doc Type: Bug Fix
Doc Text:
Previously, after rebalance, checksum match of the volume failed with transport endpoint not connected error. Now, with this update, correcting inode refresh logic inside glusterFS, the problem with checksum match is resolved.
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-11-27 15:24:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
mount log none

Description senaik 2013-05-10 10:42:29 UTC
Description of problem:
--------------------------- 
After rebalance on Distribute Replicate Volume , are-equal check sum gives error : Transport endpoint is not connected


Version-Release number of selected component (if applicable):
------------------------------------------------------------- 
3.4.0.5rhs-1.el6rhs.x86_64

How reproducible:
----------------- 

Steps to Reproduce:
-------------------- 
1.Created a 2x2 distribute-replicate volume and start it 

2.Mount the volume and fill up with some files 
Check the are-equal check sum on the mount point before rebalance : 
 
3.Add brick and start rebalance 

4. While rebalance is in progress , bring down one brick in the replica pair

gluster v rebalance DIS_REP status
Node      Rebalanced-files  size  scanned  failures  status  run time in secs
localhost     4            4.0MB    99       0       in progress   8.00
10.70.34.85   1            1.0MB    151      0       completed     2.00
10.70.34.86   0            0Bytes   150      0       completed     1.00
volume rebalance: DIS_REP: success:

gluster v status DIS_REP
Status of volume: DIS_REP
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick 10.70.34.85:/rhs/brick1/i1			N/A	N	9869
Brick 10.70.34.86:/rhs/brick1/i2			49214	Y	8847
Brick 10.70.34.105:/rhs/brick1/i3			49212	Y	12598
Brick 10.70.34.85:/rhs/brick1/i4			49229	Y	9878
Brick 10.70.34.105:/rhs/brick1/i5			49213	Y	12709
Brick 10.70.34.85:/rhs/brick1/i6			49230	Y	10012
NFS Server on localhost					2049	Y	12719
Self-heal Daemon on localhost				N/A	Y	12726
NFS Server on d834977d-9bfd-4940-8843-aedc9130bd12	2049	Y	8955
Self-heal Daemon on d834977d-9bfd-4940-8843-aedc9130bd1
2							N/A	Y	8962
NFS Server on 35a8481a-4a77-4149-a883-9db0b68e954f	2049	Y	10022
Self-heal Daemon on 35a8481a-4a77-4149-a883-9db0b68e954
f							N/A	Y	10029
 
           Task                                      ID         Status
           ----                                      --         ------
      Rebalance    c32a8332-c4f8-4d2b-9d4a-cedb55d24307              1

5. Check Rebalance status 

gluster v rebalance DIS_REP status
Node      Rebalanced-files  size  scanned  failures  status  run time in secs
localhost     25           25.0MB   176      0       completed     8.00
10.70.34.85   1            1.0MB    151      0       completed     2.00
10.70.34.86   0            0Bytes   150      0       completed     1.00
volume rebalance: DIS_REP: success:


6 .are-equal checksum on the mount point 

/opt/qa/tools/arequal-checksum /mnt/DIS_REP/
md5sum: /mnt/DIS_REP/1: Transport endpoint is not connected
/mnt/DIS_REP/1: short read
ftw (/mnt/DIS_REP/) returned -1 (Success), terminating
  
Actual results:


Expected results:


Additional info:
gluster v info DIS_REP
 
Volume Name: DIS_REP
Type: Distributed-Replicate
Volume ID: 5aaa9d6e-f60e-42ab-a723-9356b6f2999d
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.34.85:/rhs/brick1/i1
Brick2: 10.70.34.86:/rhs/brick1/i2
Brick3: 10.70.34.105:/rhs/brick1/i3
Brick4: 10.70.34.85:/rhs/brick1/i4
Brick5: 10.70.34.105:/rhs/brick1/i5
Brick6: 10.70.34.85:/rhs/brick1/i6

Comment 3 Ravishankar N 2013-05-17 08:46:02 UTC
Unable to re-produce the bug on the latest RHS2.1 downstream repo.
arequal-checksum completes successfully after rebalance with one of the bricks down (verified 3 times).

Some observations:
1.In Description, arequal-checksum is run without the -p argument, leading to "ftw (/mnt/DIS_REP/) " error:
##################################################
[root@tuxvm1 mnt]# arequal-checksum  /mnt/fuse_mnt/
ftw (/mnt/fuse_mnt/) returned -1 (No such file or directory), terminating

-------------------
[root@tuxvm1 mnt]# arequal-checksum -p /mnt/fuse_mnt/

Entry counts
Regular files   : 75
Directories     : 1
Symbolic links  : 0
Other           : 0
Total           : 76

Metadata checksums
Regular files   : 486e85
Directories     : 24d74c
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : 93ead8ad9f482c42f6a2c8d87e4333a
Directories     : 2215304f55702511
Symbolic links  : 0
Other           : 0
Total           : 441b1480b6094ef
[root@tuxvm1 mnt]# 
##################################################

2. The selinux policy seems to be enabled in the dmesg logs possibly causing connection errors.

Requesting Q.A to verify if scenario is reproducible on latest down-stream.

Comment 4 senaik 2013-05-20 10:42:41 UTC
I am able to reproduce the issue with nfs mount quite often . 

1. created a distributed volume NFS mount the volume 
2. bring down one brick and create a directory and files inside the directory
3. bring back the brick and perform a fix-layout and checked the hash range on the directory 
4. calculate checksum 
5. Perform rebalance start , after rebalance is completed , calculate checksum again . 
Intermittently I faced I/O error on mount point 

[root@RHEL6 dir2]# /opt/qa/tools/arequal-checksum /mnt/nfs_vol2 
ftw (/mnt/nfs_vol2) returned -1 (Input/output error), terminating
[root@RHEL6 dir2]# /opt/qa/tools/arequal-checksum /mnt/nfs_vol2 

Entry counts
Regular files   : 100
Directories     : 2
Symbolic links  : 0
Other           : 0
Total           : 102

Metadata checksums
Regular files   : 3e9
Directories     : 3e9
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : 9ec955064dea02b4f6511c5cb13263f5
Directories     : 32425864
Symbolic links  : 0
Other           : 0
Total           : 6898495ace9a3925

Comment 5 Ravishankar N 2013-05-21 04:56:13 UTC
1. The steps mentioned to re-create the bug in the description and comment#4 are  different.
2. I also tried reproducing the steps in comment#4 but was unable to get the error.

Could you please upload the sos report for comment#4 ?

Comment 8 senaik 2013-05-21 11:38:02 UTC
Could you please retry now

Comment 9 senaik 2013-08-12 07:39:17 UTC
Version : 3.4.0.18rhs-1.el6rhs.x86_64
========

Faced I/O error intermittently while checking are-equal checksum on the mount point after rebalance . 

Steps : 
=========
1) Created a Distributed Volume and started it 

2) Fill the mount point with some files 
for i in {1..500}; do dd if=/dev/urandom of=x"$i" bs=10M count=1; done

3) Add 2 bricks , calculate are equal checksum on mountpoint 

4) Start rebalance 

5) Stop the volume , then start it again

6) Perform rebalance start force and check Rebalance status 

7) While rebalance is in progress , stop the volume , we get the following message : 
gluster v stop Vol3
Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y
volume stop: Vol3: failed: rebalance session is in progress for the volume 'Vol3'

8)  gluster v i Vol3
 
Volume Name: Vol3
Type: Distribute
Volume ID: c22188f6-64ad-4b75-af19-6c9c77209572
Status: Started
Number of Bricks: 6
Transport-type: tcp
Bricks:
Brick1: 10.70.34.85:/rhs/brick1/c1
Brick2: 10.70.34.86:/rhs/brick1/c2
Brick3: 10.70.34.87:/rhs/brick1/c3
Brick4: 10.70.34.85:/rhs/brick1/c4
Brick5: 10.70.34.86:/rhs/brick1/c5
Brick6: 10.70.34.87:/rhs/brick1/c6

9 ) Check are equal check sum on mount point : 

[root@localhost Vol3]# /opt/qa/tools/arequal-checksum /mnt1/Vol3/
-bash: /opt/qa/tools/arequal-checksum: Input/output error
[root@localhost Vol3]# /opt/qa/tools/arequal-checksum /mnt1/Vol3/
-bash: /opt/qa/tools/arequal-checksum: Input/output error
[root@localhost Vol3]# /opt/qa/tools/arequal-checksum /mnt1/Vol3/

Entry counts
Regular files   : 500
Directories     : 1
Symbolic links  : 0
Other           : 0
Total           : 501

Metadata checksums
Regular files   : 3e9
Directories     : 24d74c
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : 36262b317baa4f39578452eea6174722
Directories     : 30312a00
Symbolic links  : 0
Other           : 0
Total           : 61a279dfed8c221b

sosreports at : 
============== 
http://rhsqe-repo.lab.eng.blr.redhat.com/bugs_necessary_info/961704/12_Aug_log/

Comment 10 senaik 2013-08-12 10:38:14 UTC
Created attachment 785637 [details]
mount log

Attaching the mount log from 5.9 client

Comment 12 Ravishankar N 2013-08-26 09:01:50 UTC
This bug is identical to BZ 996987 (Transport end point not connected error when one of the replicas is down). Therefore using the same fix for this bug too:
https://code.engineering.redhat.com/gerrit/#/c/11666

Comment 13 Ravishankar N 2013-09-06 06:58:02 UTC
Requesting QA to verify if the issue is still being hit with the fix mentioned in comment #12

Comment 14 Amar Tumballi 2013-09-06 07:05:02 UTC
setting the needinfo flag instead of moving the bug to ON_QA (as the flags were also not targetting rhs-2.1 right now.

Comment 15 senaik 2013-10-16 05:14:33 UTC
Version :  3.4.0.34rhs

Unable to reproduce the issue . Marking the bug as verified

Comment 17 errata-xmlrpc 2013-11-27 15:24:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1769.html