Bug 1277924
Summary: | Though files are in split-brain able to perform writes to the file | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | RajeshReddy <rmekala> | |
Component: | replicate | Assignee: | hari gowtham <hgowtham> | |
Status: | CLOSED ERRATA | QA Contact: | Vijay Avuthu <vavuthu> | |
Severity: | unspecified | Docs Contact: | ||
Priority: | high | |||
Version: | rhgs-3.1 | CC: | amukherj, atumball, hgowtham, ksubrahm, mchangir, nbalacha, nchilaka, pkarampu, ravishankar, rhinduja, rhs-bugs, rkavunga, sanandpa, sheggodu | |
Target Milestone: | --- | Keywords: | Reopened, ZStream | |
Target Release: | RHGS 3.4.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | glusterfs-3.12.2-12 | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1294051 (view as bug list) | Environment: | ||
Last Closed: | 2018-09-04 06:26:58 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1578823, 1579673, 1579674, 1580344 | |||
Bug Blocks: | 1294051, 1315140, 1503134 |
Description
RajeshReddy
2015-11-04 11:18:45 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions Update: =========== Verified with Build: glusterfs-3.12.2-6.el7rhgs.x86_64 1) Create 1 * 2 volume and start 2) set cluster.self-heal-daemon to off 3) create files from mount point 4) continuously append to files from different sessions 5) After few minutes, kill gluster on Node 1 6) After few min, start glusterd on NOde 1 and immediatly kill gluster on Node2 7) After few min, start glusterd on Node 2 8) IO's will fail and files will be in split-brain 9) Try to append on a file which was in split-brain and it should fail # echo "LAST APPENDING" >>f1 -bash: echo: write error: No such file or directory # Changing status to Verified Set up details git head : 9710b5edaf152142f548e04304f2ea3c1a290fe9 and changed dht_write to wait on an infinite loop 2*2 hot tier and 2*2 cold tier two clients with nfs mount point turned off all performance xlators tier-mode is "test". I created one file on the mount point, and started a write operation on the file from two mount point. After it completes the lookups and other fops, it will wait on dht_write, because of the infinite loop on dht_writev, Once the file completely migrated to the cold tier, I then allowed one say C1 client to write (by comint out of the infinite loop). Since the file is moved from the cached subvol, the write will fail, which in turn triggers dht_migration_complete_check_task. As part of this task afr subvolumes in hot tier will mark afr readables as zero. Before it updates tier cached subvolume, I allowed the second write from client say C2 to go through. It hits in the afr subvolume of hot tier and returns EIO because of zero readables. I put breakpoint on "afr_inode_refresh" to see who is calling to refresh the inode and also on afr_inode_read_subvol_set what si the value of readable array when we set it. Back traces are available from the link : https://pastebin.com/Vz3V260Q (In reply to Mohammed Rafi KC from comment #32) > Set up details > > git head : 9710b5edaf152142f548e04304f2ea3c1a290fe9 and changed dht_write to > wait on an infinite loop > > > 2*2 hot tier and 2*2 cold tier > two clients with nfs mount point > turned off all performance xlators > tier-mode is "test". > > I created one file on the mount point, and started a write operation on the > file from two mount point. After it completes the lookups and other fops, it > will wait on dht_write, because of the infinite loop on dht_writev, > > > Once the file completely migrated to the cold tier, I then allowed one say > C1 client to write (by comint out of the infinite loop). Since the file is > moved from the cached subvol, the write will fail, which in turn triggers > dht_migration_complete_check_task. As part of this task afr subvolumes in > hot tier will mark afr readables as zero. > > Before it updates tier cached subvolume, I allowed the second write from > client say C2 to go through. It hits in the afr subvolume of hot tier and > returns EIO because of zero readables. What is the behavior you are expecting in this scenario? I'll try and see if that is semantically correct in AFR or not. > > I put breakpoint on "afr_inode_refresh" to see who is calling to refresh the > inode and also on afr_inode_read_subvol_set what si the value of readable > array when we set it. > > > Back traces are available from the link : https://pastebin.com/Vz3V260Q (In reply to Pranith Kumar K from comment #37) > (In reply to Mohammed Rafi KC from comment #32) > > Set up details > > > > git head : 9710b5edaf152142f548e04304f2ea3c1a290fe9 and changed dht_write to > > wait on an infinite loop > > > > > > 2*2 hot tier and 2*2 cold tier > > two clients with nfs mount point > > turned off all performance xlators > > tier-mode is "test". > > > > I created one file on the mount point, and started a write operation on the > > file from two mount point. After it completes the lookups and other fops, it > > will wait on dht_write, because of the infinite loop on dht_writev, > > > > > > Once the file completely migrated to the cold tier, I then allowed one say > > C1 client to write (by comint out of the infinite loop). Since the file is > > moved from the cached subvol, the write will fail, which in turn triggers > > dht_migration_complete_check_task. As part of this task afr subvolumes in > > hot tier will mark afr readables as zero. > > > > Before it updates tier cached subvolume, I allowed the second write from > > client say C2 to go through. It hits in the afr subvolume of hot tier and > > returns EIO because of zero readables. > > What is the behavior you are expecting in this scenario? I'll try and see if > that is semantically correct in AFR or not. I looked the inode refresh logic in afr. If the lookup that happens as a part of inode refresh fails on AFR's children, it marks that child non-readable.In addition, if it fails on all its children, it does unwind the actual read/write FOP with the errno of the lookup failure (and with op_ret=-1). It does not unconditionally return EIO. > > > > > I put breakpoint on "afr_inode_refresh" to see who is calling to refresh the > > inode and also on afr_inode_read_subvol_set what si the value of readable > > array when we set it. > > > > > > Back traces are available from the link : https://pastebin.com/Vz3V260Q In the bt, I see that the lower xlator to AFR has failed lookup with ENOTCONN and EIO: 0x00007f6bc8726dbc in afr_inode_refresh_subvol_with_lookup_cbk (frame=0x7f6bb4043e1c, cookie=0x0, this=0x7f6bc401cbb0, op_ret=-1, op_errno=107, inode=0x7f6bb400decc, 0x00007f6bc872712d in afr_inode_refresh_subvol_with_fstat_cbk (frame=0x7f6bb001a45c, cookie=0x1, this=0x7f6bc401cbb0, op_ret=-1, op_errno=2, buf=0x7f6bc939b9c0,
> In the bt, I see that the lower xlator to AFR has failed lookup with
> ENOTCONN and EIO:
Typo, I meant to say ENOTCONN and ENOENT.
This bug ( https://bugzilla.redhat.com/show_bug.cgi?id=1326248 ) needs to be verified as the fix was reverted. If the bug still exists we might have to open a new issue and track it (either mark it as a known regression in tier or check if this patch https://review.gluster.org/#/c/20029/1 fixes it ) Update: ============ Build Used: glusterfs-3.12.2-14.el7rhgs.x86_64 How reproducible: 5/2 Tried Below scenario 1) Create 1 * 2 volume and start 2) set cluster.self-heal-daemon to off 3) create files from mount point ( f1-f6 ) 4) continuously append to files from different sessions 5) After few minutes, kill gluster on Node 1 # gluster vol heal 12 info Brick 10.70.47.45:/bricks/brick2/b0 Status: Transport endpoint is not connected Number of entries: - Brick 10.70.47.144:/bricks/brick2/b1 /f1 /f3 /f4 /f2 /f6 /f5 Status: Connected Number of entries: 6 # 6) After few min, start glusterd on NOde 1 and immediately kill gluster on Node2 # gluster vol heal 12 info Brick 10.70.47.45:/bricks/brick2/b0 /f5 /f6 /f1 /f2 /f3 Status: Connected Number of entries: 5 Brick 10.70.47.144:/bricks/brick2/b1 Status: Transport endpoint is not connected Number of entries: - # I believe at this point of time it should show 6 entries. below are the attributes for the file which are missing in heal info from Node 1: [root@dhcp47-45 ~]# getfattr -d -m . -e hex /bricks/brick2/b0/f4 getfattr: Removing leading '/' from absolute path names # file: bricks/brick2/b0/f4 security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.gfid=0xad9ebcfad1f34698a27a7025c73e1fbb trusted.gfid2path.364f55367c7bd6f4=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f6634 [root@dhcp47-45 ~]# from Node 2: [root@dhcp47-144 ~]# getfattr -d -m . -e hex /bricks/brick2/b1/f4 getfattr: Removing leading '/' from absolute path names # file: bricks/brick2/b1/f4 security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.12-client-0=0x000007ec0000000000000000 trusted.afr.dirty=0x000000000000000000000000 trusted.gfid=0xad9ebcfad1f34698a27a7025c73e1fbb trusted.gfid2path.364f55367c7bd6f4=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f6634 [root@dhcp47-144 ~]# 7) After few min, start glusterd on Node 2 8) IO's will fail and files will be in split-brain 9) Try to append on a file which was in split-brain and it should fail [root@dhcp35-125 ~]# echo "TEST" >>/mnt/12/f1 -bash: echo: write error: Input/output error [root@dhcp35-125 ~]# > ls is hang on mount point ( on file f4 ) [root@dhcp35-125 12]# ls ls: cannot access f2: Input/output error ls: cannot access f3: Input/output error > Ran Health Report Tool ( reported 2 errors due to report tool issue ) > SOS Reports : http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/vavuthu/bug_1277924_hang/ > ls is hang on mount point ( on file f4 ) Karthik, could you take a look at the hang issue on Vijay's set up please? https://code.engineering.redhat.com/gerrit/144275 hasn't made it to a build yet and could be the reason for hang. If it is not, we need to see if there are some other code paths leading to the hang (unlikely). I checked the sos-reports and the statedumps of bricks and clients. It does not seem like it is the same issue as of https://code.engineering.redhat.com/gerrit/144275, and it should not be the reason because lookup will not take lock. When I checked the client statedumps it seems like it is hung in the write-behind lookup path. [global.callpool.stack.11] stack=0x7fa99d4f1370 uid=0 gid=0 pid=10602 unique=1187862 lk-owner=0000000000000000 op=LOOKUP type=1 cnt=6 [global.callpool.stack.11.frame.1] frame=0x7fa99d4e1a50 ref_count=0 translator=12-write-behind complete=0 parent=12-io-cache wind_from=ioc_lookup wind_to=FIRST_CHILD(this)->fops->lookup unwind_to=ioc_lookup_cbk [global.callpool.stack.11.frame.2] frame=0x7fa99d4d5420 ref_count=1 translator=12-io-cache complete=0 parent=12-quick-read wind_from=qr_lookup wind_to=(this->children->xlator)->fops->lookup unwind_to=qr_lookup_cbk [global.callpool.stack.11.frame.3] frame=0x7fa99d4d4a30 ref_count=1 translator=12-quick-read complete=0 parent=12-md-cache wind_from=mdc_lookup wind_to=FIRST_CHILD(this)->fops->lookup unwind_to=mdc_lookup_cbk [global.callpool.stack.11.frame.4] frame=0x7fa99d4dfb00 ref_count=1 translator=12-md-cache complete=0 parent=12 wind_from=io_stats_lookup wind_to=(this->children->xlator)->fops->lookup unwind_to=io_stats_lookup_cbk [global.callpool.stack.11.frame.5] frame=0x7fa99d4dd400 ref_count=1 translator=12 complete=0 parent=fuse wind_from=fuse_lookup_resume wind_to=FIRST_CHILD(this)->fops->lookup unwind_to=fuse_lookup_cbk [global.callpool.stack.11.frame.6] frame=0x7fa99d4f0180 ref_count=1 translator=fuse complete=0 Vijay is not able to reproduce this issue on the same setup now. @Vijay, since it is not hit always and this is a separate issue which has nothing to do with the fix and the fix is working as expected, can we move this to verified and open a new bug for the hang, if it is reproducible again? Update: ========== > tried same scenario ( comment #42 ) several times and not able to reproduce the hang issue. > Original issue was files in split-brain shouldn't append Scenario: Verified with Build: glusterfs-3.12.2-14.el7rhgs.x86_64 1) Create 1 * 2 volume and start 2) set cluster.self-heal-daemon to off 3) create files from mount point 4) continuously append to files from different sessions 5) After few minutes, kill gluster on Node 1 6) After few min, start glusterd on NOde 1 and immediately kill gluster on Node2 7) After few min, start glusterd on Node 2 8) IO's will fail and files will be in split-brain 9) Try to append on a file which was in split-brain and it should fail # echo "LAST APPENDING" >>f1 -bash: f1: Input/output error # Changing status to Verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607 |