Bug 1635145
Summary: | I/O errors observed on the application side after the creation of a 'linkto' file | ||
---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Raghavendra G <rgowdapp> |
Component: | distribute | Assignee: | Raghavendra G <rgowdapp> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | mainline | CC: | atumball, bkunal, nbalacha, nravinas, rgowdapp, rhs-bugs, sankarshan, spalai, storage-qa-internal, tdesala, vbellur |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | glusterfs-6.0 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | 1634649 | Environment: | |
Last Closed: | 2019-03-25 16:31:10 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1634649 | ||
Bug Blocks: |
COMMIT: https://review.gluster.org/21353 committed in master by "Amar Tumballi" <amarts> with a commit message- cluster/dht: fixes to unlinking invalid linkto file If unlinking of an invalid linkto file failed in lookup-everywhere codepath, lookup was failed with EIO. The rational as per the comment was, <snip> /* When dht_lookup_everywhere is performed, one cached *and one hashed file was found and hashed file does *not point to the above mentioned cached node. So it *was considered as stale and an unlink was performed. *But unlink fails. So may be rebalance is in progress. *now ideally we have two data-files. One obtained during *lookup_everywhere and one where unlink-failed. So *at this point in time we cannot decide which one to *choose because there are chances of first cached *file is truncated after rebalance and if it is chosen *as cached node, application will fail. So return EIO. */ </snip> However, this reasoning is only valid when * op_errno is EBUSY, indicating rebalance is in progress * op_errno is ENOTCONN as wecannot determine what was the status of file on brick. Hence this patch doesn't fail lookup unless unlink fails with an either EBUSY or ENOTCONN Change-Id: Ife55f3d97fe557f3db05beae0c2d786df31e8e55 Fixes: bz#1635145 Signed-off-by: Raghavendra Gowdappa <rgowdapp> This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-6.0, please open a new bug report. glusterfs-6.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://lists.gluster.org/pipermail/announce/2019-March/000120.html [2] https://www.gluster.org/pipermail/gluster-users/ |
I/O errors observed on ngix logs when linkto files are created. > Unfortunately, the customer doesn't have any steps to reproduce this issue > on demand. Enabling parallel-readdir setting triggers this issue with a > higher frequency. You are right. Enabling performance.parallel-readdir adds a readdir-ahead translator on top of each subvolume of DHT and hence subvolume name changes. Since linkto files store the name of subvolume containing data file, after enabling performance.parallel-readdir, the content of linkto file no longer contains the name of any of subvolumes of DHT and hence dht_linkfile_subvol returns a NULL resulting in error msg and dht_lookup_everywhere. Also, it looks like unlinking the older linkto file failed (may be there are more than one lookups and a racing lookup cleaned up the older linkto file?) resulting in failure of lookup with EIO. I am still investigating 1. why unlink failed with ENOENT? 2. whether we need to fail lookup if unlink of linkto fails? regards, Raghavendra --- Additional comment from Raghavendra G on 2018-10-02 04:27:01 EDT --- From dht_lookup_unlink_of_false_linkto_cbk, the rational for failing lookup with EIO was that there are two files - one on hashed and the other on cached: <snip> + /*When dht_lookup_everywhere is performed, one cached + *and one hashed file was found and hashed file does + *not point to the above mentioned cached node. So it + *was considered as stale and an unlink was performed. + *But unlink fails. So may be rebalance is in progress. + *now ideally we have two data-files. One obtained during + *lookup_everywhere and one where unlink-failed. So + *at this point in time we cannot decide which one to + *choose because there are chances of first cached + *file is truncated after rebalance and if it is choosen + *as cached node, application will fail. So return EIO.*/ </snip> However, as can be seen in the error msg above, [2018-09-28 13:20:10.589411] I [MSGID: 109069] [dht-common.c:2103:dht_lookup_unlink_of_false_linkto_cbk] 2-sputnikimages-dht: lookup_unlink returned with op_ret -> -1 and op-errno -> 2 for <filename removed> the op_errno is ENOENT. So, the file is not present on hashed subvol and the above reasoning no longer applies for this scenario and hence failing lookup is wrong. I'll send a patch to not fail lookup if op_errno happens to be ENOENT. This is the immediate fix for this bug, while I'll be investigating questions posed in previous comment.