Bug 1726673

Summary: Failures in remove-brick due to [Input/output error] errors
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Sayalee <saraut>
Component: replicateAssignee: Karthik U S <ksubrahm>
Status: CLOSED ERRATA QA Contact: Veera Raghava Reddy <vereddy>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.5CC: ksubrahm, pasik, pprakash, puebele, ravishankar, rhs-bugs, rkothiya, sheggodu, storage-qa-internal, ubansal
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.5.z Batch Update 4Flags: rkavunga: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-6.0-50 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1728770 (view as bug list) Environment:
Last Closed: 2021-04-29 07:20:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1728770, 1749305, 1749307, 1749352    

Description Sayalee 2019-07-03 12:00:58 UTC
Description of problem:
While performing remove-brick to convert 3X3 volume to 2X3 volume, there were failures in remove-brick rebalance due to " E [MSGID: 114031] [client-rpc-fops_v2.c:2540:client4_0_opendir_cbk] 0-vol4-client-8: remote operation failed. Path: /dir1/thread0/level03/level13/level23/level33/level43 (69e97af3-d2d7-450a-881e-0c4ef6ac1355) [Input/output error] "

Version-Release number of selected component (if applicable):
6.0.7

How reproducible:
1/1

Steps to Reproduce:
1. Created 1X3 volume.
2. Fuse mount the volume and start I/O on the volume.
3. Convert it into 2X3 volume, triggered rebalance.
4. Let the rebalance complete and then convert into 3X3 volume;triggered rebalance.
5. After that, started remove-brick operation on the volume to convert it back    into 2X3 volume.
6. Check the remove-brick status.

Actual results:
There are failures in remove-brick rebalance.
Errors from rebalance logs:
E [MSGID: 114031] [client-rpc-fops_v2.c:2540:client4_0_opendir_cbk] 0-vol4-client-2: remote operation failed. Path: /dir1/thread0/level03/level13/level23/level33/level43 (69e97af3-d2d7-450a-881e-0c4ef6ac1355) [Input/output error]

E [MSGID: 114031] [client-rpc-fops_v2.c:2540:client4_0_opendir_cbk] 0-vol4-client-8: remote operation failed. Path: /dir1/thread0/level03/level13/level23/level33/level43 (69e97af3-d2d7-450a-881e-0c4ef6ac1355) [Input/output error]

W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-vol4-client-8: remote operation failed. Path: /dir1/thread0/level03/level13/level23/level33/level43/level53/5d1b1579%%P3TRO7PG35 (558423e2-478e-40e9-9958-31c710e50b89) [Input/output error]

W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-vol4-client-2: remote operation failed. Path: /dir1/thread0/level03/level13/level23/level33/level43 (69e97af3-d2d7-450a-881e-0c4ef6ac1355) [Input/output error]


Expected results:
Remove-brick should complete successfully.

Additional info:
sos-report will be shared.

Comment 7 Mohammed Rafi KC 2019-07-10 16:08:22 UTC
RCA:

As mentioned in the comment6, it failed because the lookup couldn't return lock count requested through GLUSTERFS_POSIXLK_COUNT. This is because While processing afr_lookup_cbk, if it requires a name heal, we process the name heal in afr_lookup_selfheal_wrap by wiping all the current lookup data. And after finishing the lookup we return the fresh data. But here when doing the healing using lookup we are not passing the xdata_req, which then posix misses to populate lock count.

<code>

2802 int
2803 afr_lookup_selfheal_wrap(void *opaque)
2804 {
2805     int ret = 0;
2806     call_frame_t *frame = opaque;
2807     afr_local_t *local = NULL;
2808     xlator_t *this = NULL;
2809     inode_t *inode = NULL;
2810     uuid_t pargfid = {
2811         0,
2812     };
2813 
2814     local = frame->local;
2815     this = frame->this;
2816     loc_pargfid(&local->loc, pargfid);
2817 
2818     ret = afr_selfheal_name(frame->this, pargfid, local->loc.name,
2819                             &local->cont.lookup.gfid_req, local->xattr_req);
2820     if (ret == -EIO)
2821         goto unwind;
2822     
2823     afr_local_replies_wipe(local, this->private);
2824     
2825     inode = afr_selfheal_unlocked_lookup_on(frame, local->loc.parent,
2826                                             local->loc.name, local->replies,
2827                                             local->child_up, NULL);
2828     if (inode)
2829         inode_unref(inode);
2830     
2831     afr_lookup_metadata_heal_check(frame, this);
2832     return 0;
2833 
2834 unwind:
2835     AFR_STACK_UNWIND(lookup, frame, -1, EIO, NULL, NULL, NULL, NULL);
2836     return 0;
</code>

Comment 8 Mohammed Rafi KC 2019-07-10 16:22:16 UTC
upstream patch: https://review.gluster.org/#/c/glusterfs/+/23024

Comment 35 errata-xmlrpc 2021-04-29 07:20:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (glusterfs bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1462