Description of problem: --------------------------- After rebalance on Distribute Replicate Volume , are-equal check sum gives error : Transport endpoint is not connected Version-Release number of selected component (if applicable): ------------------------------------------------------------- 3.4.0.5rhs-1.el6rhs.x86_64 How reproducible: ----------------- Steps to Reproduce: -------------------- 1.Created a 2x2 distribute-replicate volume and start it 2.Mount the volume and fill up with some files Check the are-equal check sum on the mount point before rebalance : 3.Add brick and start rebalance 4. While rebalance is in progress , bring down one brick in the replica pair gluster v rebalance DIS_REP status Node Rebalanced-files size scanned failures status run time in secs localhost 4 4.0MB 99 0 in progress 8.00 10.70.34.85 1 1.0MB 151 0 completed 2.00 10.70.34.86 0 0Bytes 150 0 completed 1.00 volume rebalance: DIS_REP: success: gluster v status DIS_REP Status of volume: DIS_REP Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.34.85:/rhs/brick1/i1 N/A N 9869 Brick 10.70.34.86:/rhs/brick1/i2 49214 Y 8847 Brick 10.70.34.105:/rhs/brick1/i3 49212 Y 12598 Brick 10.70.34.85:/rhs/brick1/i4 49229 Y 9878 Brick 10.70.34.105:/rhs/brick1/i5 49213 Y 12709 Brick 10.70.34.85:/rhs/brick1/i6 49230 Y 10012 NFS Server on localhost 2049 Y 12719 Self-heal Daemon on localhost N/A Y 12726 NFS Server on d834977d-9bfd-4940-8843-aedc9130bd12 2049 Y 8955 Self-heal Daemon on d834977d-9bfd-4940-8843-aedc9130bd1 2 N/A Y 8962 NFS Server on 35a8481a-4a77-4149-a883-9db0b68e954f 2049 Y 10022 Self-heal Daemon on 35a8481a-4a77-4149-a883-9db0b68e954 f N/A Y 10029 Task ID Status ---- -- ------ Rebalance c32a8332-c4f8-4d2b-9d4a-cedb55d24307 1 5. Check Rebalance status gluster v rebalance DIS_REP status Node Rebalanced-files size scanned failures status run time in secs localhost 25 25.0MB 176 0 completed 8.00 10.70.34.85 1 1.0MB 151 0 completed 2.00 10.70.34.86 0 0Bytes 150 0 completed 1.00 volume rebalance: DIS_REP: success: 6 .are-equal checksum on the mount point /opt/qa/tools/arequal-checksum /mnt/DIS_REP/ md5sum: /mnt/DIS_REP/1: Transport endpoint is not connected /mnt/DIS_REP/1: short read ftw (/mnt/DIS_REP/) returned -1 (Success), terminating Actual results: Expected results: Additional info: gluster v info DIS_REP Volume Name: DIS_REP Type: Distributed-Replicate Volume ID: 5aaa9d6e-f60e-42ab-a723-9356b6f2999d Status: Started Number of Bricks: 3 x 2 = 6 Transport-type: tcp Bricks: Brick1: 10.70.34.85:/rhs/brick1/i1 Brick2: 10.70.34.86:/rhs/brick1/i2 Brick3: 10.70.34.105:/rhs/brick1/i3 Brick4: 10.70.34.85:/rhs/brick1/i4 Brick5: 10.70.34.105:/rhs/brick1/i5 Brick6: 10.70.34.85:/rhs/brick1/i6
sos reports at : http://rhsqe-repo.lab.eng.blr.redhat.com/bugs_necessary_info/961704/
Unable to re-produce the bug on the latest RHS2.1 downstream repo. arequal-checksum completes successfully after rebalance with one of the bricks down (verified 3 times). Some observations: 1.In Description, arequal-checksum is run without the -p argument, leading to "ftw (/mnt/DIS_REP/) " error: ################################################## [root@tuxvm1 mnt]# arequal-checksum /mnt/fuse_mnt/ ftw (/mnt/fuse_mnt/) returned -1 (No such file or directory), terminating ------------------- [root@tuxvm1 mnt]# arequal-checksum -p /mnt/fuse_mnt/ Entry counts Regular files : 75 Directories : 1 Symbolic links : 0 Other : 0 Total : 76 Metadata checksums Regular files : 486e85 Directories : 24d74c Symbolic links : 3e9 Other : 3e9 Checksums Regular files : 93ead8ad9f482c42f6a2c8d87e4333a Directories : 2215304f55702511 Symbolic links : 0 Other : 0 Total : 441b1480b6094ef [root@tuxvm1 mnt]# ################################################## 2. The selinux policy seems to be enabled in the dmesg logs possibly causing connection errors. Requesting Q.A to verify if scenario is reproducible on latest down-stream.
I am able to reproduce the issue with nfs mount quite often . 1. created a distributed volume NFS mount the volume 2. bring down one brick and create a directory and files inside the directory 3. bring back the brick and perform a fix-layout and checked the hash range on the directory 4. calculate checksum 5. Perform rebalance start , after rebalance is completed , calculate checksum again . Intermittently I faced I/O error on mount point [root@RHEL6 dir2]# /opt/qa/tools/arequal-checksum /mnt/nfs_vol2 ftw (/mnt/nfs_vol2) returned -1 (Input/output error), terminating [root@RHEL6 dir2]# /opt/qa/tools/arequal-checksum /mnt/nfs_vol2 Entry counts Regular files : 100 Directories : 2 Symbolic links : 0 Other : 0 Total : 102 Metadata checksums Regular files : 3e9 Directories : 3e9 Symbolic links : 3e9 Other : 3e9 Checksums Regular files : 9ec955064dea02b4f6511c5cb13263f5 Directories : 32425864 Symbolic links : 0 Other : 0 Total : 6898495ace9a3925
1. The steps mentioned to re-create the bug in the description and comment#4 are different. 2. I also tried reproducing the steps in comment#4 but was unable to get the error. Could you please upload the sos report for comment#4 ?
sosreports for comment4 : http://rhsqe-repo.lab.eng.blr.redhat.com/bugs_necessary_info/961704_logs/
Could you please retry now
Version : 3.4.0.18rhs-1.el6rhs.x86_64 ======== Faced I/O error intermittently while checking are-equal checksum on the mount point after rebalance . Steps : ========= 1) Created a Distributed Volume and started it 2) Fill the mount point with some files for i in {1..500}; do dd if=/dev/urandom of=x"$i" bs=10M count=1; done 3) Add 2 bricks , calculate are equal checksum on mountpoint 4) Start rebalance 5) Stop the volume , then start it again 6) Perform rebalance start force and check Rebalance status 7) While rebalance is in progress , stop the volume , we get the following message : gluster v stop Vol3 Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y volume stop: Vol3: failed: rebalance session is in progress for the volume 'Vol3' 8) gluster v i Vol3 Volume Name: Vol3 Type: Distribute Volume ID: c22188f6-64ad-4b75-af19-6c9c77209572 Status: Started Number of Bricks: 6 Transport-type: tcp Bricks: Brick1: 10.70.34.85:/rhs/brick1/c1 Brick2: 10.70.34.86:/rhs/brick1/c2 Brick3: 10.70.34.87:/rhs/brick1/c3 Brick4: 10.70.34.85:/rhs/brick1/c4 Brick5: 10.70.34.86:/rhs/brick1/c5 Brick6: 10.70.34.87:/rhs/brick1/c6 9 ) Check are equal check sum on mount point : [root@localhost Vol3]# /opt/qa/tools/arequal-checksum /mnt1/Vol3/ -bash: /opt/qa/tools/arequal-checksum: Input/output error [root@localhost Vol3]# /opt/qa/tools/arequal-checksum /mnt1/Vol3/ -bash: /opt/qa/tools/arequal-checksum: Input/output error [root@localhost Vol3]# /opt/qa/tools/arequal-checksum /mnt1/Vol3/ Entry counts Regular files : 500 Directories : 1 Symbolic links : 0 Other : 0 Total : 501 Metadata checksums Regular files : 3e9 Directories : 24d74c Symbolic links : 3e9 Other : 3e9 Checksums Regular files : 36262b317baa4f39578452eea6174722 Directories : 30312a00 Symbolic links : 0 Other : 0 Total : 61a279dfed8c221b sosreports at : ============== http://rhsqe-repo.lab.eng.blr.redhat.com/bugs_necessary_info/961704/12_Aug_log/
Created attachment 785637 [details] mount log Attaching the mount log from 5.9 client
This bug is identical to BZ 996987 (Transport end point not connected error when one of the replicas is down). Therefore using the same fix for this bug too: https://code.engineering.redhat.com/gerrit/#/c/11666
Requesting QA to verify if the issue is still being hit with the fix mentioned in comment #12
setting the needinfo flag instead of moving the bug to ON_QA (as the flags were also not targetting rhs-2.1 right now.
Version : 3.4.0.34rhs Unable to reproduce the issue . Marking the bug as verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1769.html