Bug 961704
Summary: | are-equal check sum after rebalance giving error :Transport endpoint is not connected | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | senaik | ||||
Component: | glusterfs | Assignee: | Ravishankar N <ravishankar> | ||||
Status: | CLOSED ERRATA | QA Contact: | senaik | ||||
Severity: | high | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 2.1 | CC: | amarts, asriram, kaushal, kparthas, rhs-bugs, sdharane, senaik, sgowda, vbellur | ||||
Target Milestone: | --- | Keywords: | ZStream | ||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | glusterfs-3.4.0.34rhs | Doc Type: | Bug Fix | ||||
Doc Text: |
Previously, after rebalance, checksum match of the volume failed with transport endpoint not connected error. Now, with this update, correcting inode refresh logic inside glusterFS, the problem with checksum match is resolved.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2013-11-27 15:24:58 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
senaik
2013-05-10 10:42:29 UTC
sos reports at : http://rhsqe-repo.lab.eng.blr.redhat.com/bugs_necessary_info/961704/ Unable to re-produce the bug on the latest RHS2.1 downstream repo. arequal-checksum completes successfully after rebalance with one of the bricks down (verified 3 times). Some observations: 1.In Description, arequal-checksum is run without the -p argument, leading to "ftw (/mnt/DIS_REP/) " error: ################################################## [root@tuxvm1 mnt]# arequal-checksum /mnt/fuse_mnt/ ftw (/mnt/fuse_mnt/) returned -1 (No such file or directory), terminating ------------------- [root@tuxvm1 mnt]# arequal-checksum -p /mnt/fuse_mnt/ Entry counts Regular files : 75 Directories : 1 Symbolic links : 0 Other : 0 Total : 76 Metadata checksums Regular files : 486e85 Directories : 24d74c Symbolic links : 3e9 Other : 3e9 Checksums Regular files : 93ead8ad9f482c42f6a2c8d87e4333a Directories : 2215304f55702511 Symbolic links : 0 Other : 0 Total : 441b1480b6094ef [root@tuxvm1 mnt]# ################################################## 2. The selinux policy seems to be enabled in the dmesg logs possibly causing connection errors. Requesting Q.A to verify if scenario is reproducible on latest down-stream. I am able to reproduce the issue with nfs mount quite often . 1. created a distributed volume NFS mount the volume 2. bring down one brick and create a directory and files inside the directory 3. bring back the brick and perform a fix-layout and checked the hash range on the directory 4. calculate checksum 5. Perform rebalance start , after rebalance is completed , calculate checksum again . Intermittently I faced I/O error on mount point [root@RHEL6 dir2]# /opt/qa/tools/arequal-checksum /mnt/nfs_vol2 ftw (/mnt/nfs_vol2) returned -1 (Input/output error), terminating [root@RHEL6 dir2]# /opt/qa/tools/arequal-checksum /mnt/nfs_vol2 Entry counts Regular files : 100 Directories : 2 Symbolic links : 0 Other : 0 Total : 102 Metadata checksums Regular files : 3e9 Directories : 3e9 Symbolic links : 3e9 Other : 3e9 Checksums Regular files : 9ec955064dea02b4f6511c5cb13263f5 Directories : 32425864 Symbolic links : 0 Other : 0 Total : 6898495ace9a3925 1. The steps mentioned to re-create the bug in the description and comment#4 are different. 2. I also tried reproducing the steps in comment#4 but was unable to get the error. Could you please upload the sos report for comment#4 ? Could you please retry now Version : 3.4.0.18rhs-1.el6rhs.x86_64 ======== Faced I/O error intermittently while checking are-equal checksum on the mount point after rebalance . Steps : ========= 1) Created a Distributed Volume and started it 2) Fill the mount point with some files for i in {1..500}; do dd if=/dev/urandom of=x"$i" bs=10M count=1; done 3) Add 2 bricks , calculate are equal checksum on mountpoint 4) Start rebalance 5) Stop the volume , then start it again 6) Perform rebalance start force and check Rebalance status 7) While rebalance is in progress , stop the volume , we get the following message : gluster v stop Vol3 Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y volume stop: Vol3: failed: rebalance session is in progress for the volume 'Vol3' 8) gluster v i Vol3 Volume Name: Vol3 Type: Distribute Volume ID: c22188f6-64ad-4b75-af19-6c9c77209572 Status: Started Number of Bricks: 6 Transport-type: tcp Bricks: Brick1: 10.70.34.85:/rhs/brick1/c1 Brick2: 10.70.34.86:/rhs/brick1/c2 Brick3: 10.70.34.87:/rhs/brick1/c3 Brick4: 10.70.34.85:/rhs/brick1/c4 Brick5: 10.70.34.86:/rhs/brick1/c5 Brick6: 10.70.34.87:/rhs/brick1/c6 9 ) Check are equal check sum on mount point : [root@localhost Vol3]# /opt/qa/tools/arequal-checksum /mnt1/Vol3/ -bash: /opt/qa/tools/arequal-checksum: Input/output error [root@localhost Vol3]# /opt/qa/tools/arequal-checksum /mnt1/Vol3/ -bash: /opt/qa/tools/arequal-checksum: Input/output error [root@localhost Vol3]# /opt/qa/tools/arequal-checksum /mnt1/Vol3/ Entry counts Regular files : 500 Directories : 1 Symbolic links : 0 Other : 0 Total : 501 Metadata checksums Regular files : 3e9 Directories : 24d74c Symbolic links : 3e9 Other : 3e9 Checksums Regular files : 36262b317baa4f39578452eea6174722 Directories : 30312a00 Symbolic links : 0 Other : 0 Total : 61a279dfed8c221b sosreports at : ============== http://rhsqe-repo.lab.eng.blr.redhat.com/bugs_necessary_info/961704/12_Aug_log/ Created attachment 785637 [details]
mount log
Attaching the mount log from 5.9 client
This bug is identical to BZ 996987 (Transport end point not connected error when one of the replicas is down). Therefore using the same fix for this bug too: https://code.engineering.redhat.com/gerrit/#/c/11666 Requesting QA to verify if the issue is still being hit with the fix mentioned in comment #12 setting the needinfo flag instead of moving the bug to ON_QA (as the flags were also not targetting rhs-2.1 right now. Version : 3.4.0.34rhs Unable to reproduce the issue . Marking the bug as verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1769.html |