1734252 – Heal not completing after geo-rep session is stopped on EC volumes.

Bug 1734252 - Heal not completing after geo-rep session is stopped on EC volumes.

Summary: Heal not completing after geo-rep session is stopped on EC volumes.

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	disperse
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Ashish Pandey
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1733531
Blocks:
TreeView+	depends on / blocked

Reported:	2019-07-30 05:00 UTC by Ashish Pandey
Modified:	2019-07-30 12:58 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:	1733531
Environment:
Last Closed:	2019-07-30 12:58:23 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Gluster.org Gerrit	23129	0	None	Merged	cluster/ec: Create heal task with heal process id	2019-07-30 12:58:22 UTC

Comment 1 Ashish Pandey 2019-07-30 05:02:11 UTC

On a fresh setup, If I perform the below mentioned steps:
1.Create a Distributed-Disperse and Disperse volume disabled shd.
2.Started the volume on both master and slave.
3.Created and started a geo-rep setup. 
4.Mounted the volume and started untar from client.
5.Stopped geo-rep session after sometime.
5.Checked heal info on both volumes.(There was nothing to heal.)
6.After sometime enabled shd on both volumes.
7.Checked heal info on both volumes.(There was nothing to heal.)

I am not seeing any issues when shd is disabled at the start and then enabled.the trigger point here seems to be the geo-rep session stop when shd is enabled.

Comment 2 Ashish Pandey 2019-07-30 05:02:23 UTC

Here is the root cause of the issue -

when features.read-only is enabled, ro_fxattrop will check for the following condition - 
    if (is_readonly_or_worm_enabled(frame, this) && !allzero)                                            
        STACK_UNWIND_STRICT(fxattrop, frame, -1, EROFS, NULL, xdata);  

In this is_readonly_or_worm_enabled(frame, this) will return "false" for shd if frame->root->pid < 0, which we set for the frame used in healing as "-6".

However, in this case this frame->root->pid is coming up with value as "0". That's why this condition is failing (0 < 0) and the function returning "true" 
and making this as read-only for shd process also.

Why is it happening?

when shd triggers heal for the file, it is finding that there is nothing to heal so it is calling "ec_data_undo_pending" to remove  dirty flag for data part 
which in turn calling syncop_fxattrop->SYNCOP

We do not pass frame to this SYNCOP and it gets the frame from the task -
                                                                               \
        task = synctask_get();                                                 \
        stb->task = task;                                                      \
        if (task)                                                              \
            frame = task->opframe;   

However, while creating task we provided frame as NULL.

ec_launch_heal(ec_t *ec, ec_fop_data_t *fop)
{
    int ret = 0;

    ret = synctask_new(ec->xl->ctx->env, ec_synctask_heal_wrap, ec_heal_done,
                       NULL, fop);

----------------------------
So synctask_create will create a task with new frame but it will not set frame->root->pid as -6 and it will be "0" only.
This is what we are checking in "is_readonly_or_worm_enabled" and getting read-only as TRUE and the heal (fxattrop) is failing with "read-only file system" error.

When we don't enable feature.read-only, this xlator will not be loaded and this condition will not be checked and hence fxattrop sent by "ec_data_undo_pending"
will succeed and it will remove the dirty [data part] flag.

Comment 3 Ashish Pandey 2019-07-30 05:34:51 UTC


https://review.gluster.org/#/c/glusterfs/+/23129/

Comment 4 Worker Ant 2019-07-30 05:36:19 UTC

REVIEW: https://review.gluster.org/23129 (cluster/ec: Create heal task with heal process id) posted (#1) for review on master by Ashish Pandey

Comment 5 Worker Ant 2019-07-30 12:58:23 UTC

REVIEW: https://review.gluster.org/23129 (cluster/ec: Create heal task with heal process id) merged (#2) on master by Amar Tumballi

Note You need to log in before you can comment on or make changes to this bug.