Bug 1734252

Summary: Heal not completing after geo-rep session is stopped on EC volumes.
Product: [Community] GlusterFS Reporter: Ashish Pandey <aspandey>
Component: disperseAssignee: Ashish Pandey <aspandey>
Severity: high Docs Contact:
Priority: unspecified    
Version: mainlineCC: bugs, kiyer, nchilaka, rhs-bugs, sankarshan, storage-qa-internal
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1733531 Environment:
Last Closed: 2019-07-30 12:58:23 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1733531    
Bug Blocks:    

Comment 1 Ashish Pandey 2019-07-30 05:02:11 UTC
On a fresh setup, If I perform the below mentioned steps:
1.Create a Distributed-Disperse and Disperse volume disabled shd.
2.Started the volume on both master and slave.
3.Created and started a geo-rep setup. 
4.Mounted the volume and started untar from client.
5.Stopped geo-rep session after sometime.
5.Checked heal info on both volumes.(There was nothing to heal.)
6.After sometime enabled shd on both volumes.
7.Checked heal info on both volumes.(There was nothing to heal.)

I am not seeing any issues when shd is disabled at the start and then enabled.the trigger point here seems to be the geo-rep session stop when shd is enabled.

Comment 2 Ashish Pandey 2019-07-30 05:02:23 UTC
Here is the root cause of the issue -

when features.read-only is enabled, ro_fxattrop will check for the following condition - 
    if (is_readonly_or_worm_enabled(frame, this) && !allzero)                                            
        STACK_UNWIND_STRICT(fxattrop, frame, -1, EROFS, NULL, xdata);  

In this is_readonly_or_worm_enabled(frame, this) will return "false" for shd if frame->root->pid < 0, which we set for the frame used in healing as "-6".

However, in this case this frame->root->pid is coming up with value as "0". That's why this condition is failing (0 < 0) and the function returning "true" 
and making this as read-only for shd process also.

Why is it happening?

when shd triggers heal for the file, it is finding that there is nothing to heal so it is calling "ec_data_undo_pending" to remove  dirty flag for data part 
which in turn calling syncop_fxattrop->SYNCOP

We do not pass frame to this SYNCOP and it gets the frame from the task -
        task = synctask_get();                                                 \
        stb->task = task;                                                      \
        if (task)                                                              \
            frame = task->opframe;   

However, while creating task we provided frame as NULL.

ec_launch_heal(ec_t *ec, ec_fop_data_t *fop)
    int ret = 0;

    ret = synctask_new(ec->xl->ctx->env, ec_synctask_heal_wrap, ec_heal_done,
                       NULL, fop);

So synctask_create will create a task with new frame but it will not set frame->root->pid as -6 and it will be "0" only.
This is what we are checking in "is_readonly_or_worm_enabled" and getting read-only as TRUE and the heal (fxattrop) is failing with "read-only file system" error.

When we don't enable feature.read-only, this xlator will not be loaded and this condition will not be checked and hence fxattrop sent by "ec_data_undo_pending"
will succeed and it will remove the dirty [data part] flag.

Comment 3 Ashish Pandey 2019-07-30 05:34:51 UTC


Comment 4 Worker Ant 2019-07-30 05:36:19 UTC
REVIEW: https://review.gluster.org/23129 (cluster/ec: Create heal task with heal process id) posted (#1) for review on master by Ashish Pandey

Comment 5 Worker Ant 2019-07-30 12:58:23 UTC
REVIEW: https://review.gluster.org/23129 (cluster/ec: Create heal task with heal process id) merged (#2) on master by Amar Tumballi