Bug 1740968

Summary: glustershd can not decide heald_sinks, and skip repair, so some entries lingering in volume heal info
Product: [Community] GlusterFS Reporter: zhou lin <zz.sh.cynthia>
Component: replicateAssignee: Karthik U S <ksubrahm>
Status: CLOSED NEXTRELEASE QA Contact: Nag Pavan Chilakam <nchilaka>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.1CC: bugs, rhs-bugs, shujun.huang, storage-qa-internal
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1749322 (view as bug list) Environment:
Last Closed: 2019-10-11 08:34:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1749322    
Bug Blocks:    
Attachments:
Description Flags
mn-0 node glustershd log+ services brick glsuterfsd log none

Description zhou lin 2019-08-14 02:48:18 UTC
Description of problem:
[root@mn-0:/home/robot]
# gluster v heal services info
Brick mn-0.local:/mnt/bricks/services/brick
/db/upgrade 
Status: Connected
Number of entries: 1

Brick mn-1.local:/mnt/bricks/services/brick
/db/upgrade 
Status: Connected
Number of entries: 1

Brick dbm-0.local:/mnt/bricks/services/brick
Status: Connected
Number of entries: 0

those entries keeps showing in gluster v heal info command,
from glustershd log, each times when glustershd deal with this entry, nothing real is done, from gdb info, shd can not decide the heald_sinks, so nothing is done at each round of repair


[root@mn-0:/home/robot]
# gluster v heal services info
Brick mn-0.local:/mnt/bricks/services/brick
/db/upgrade 
Status: Connected
Number of entries: 1

Brick mn-1.local:/mnt/bricks/services/brick
/db/upgrade 
Status: Connected
Number of entries: 1

Brick dbm-0.local:/mnt/bricks/services/brick
Status: Connected
Number of entries: 0


[Env info]
Three bricks mn-0, mn-1,dbm-0
[root@mn-1:/mnt/bricks/services/brick/db/upgrade]
# gluster v info services

Volume Name: services
Type: Replicate
Volume ID: 062748ce-0876-46f6-9936-d9ff3a2b110a
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: mn-0.local:/mnt/bricks/services/brick
Brick2: mn-1.local:/mnt/bricks/services/brick
Brick3: dbm-0.local:/mnt/bricks/services/brick
Options Reconfigured:
cluster.heal-timeout: 60
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
cluster.server-quorum-type: none
cluster.quorum-type: auto
cluster.quorum-reads: true
cluster.consistent-metadata: on
server.allow-insecure: on
network.ping-timeout: 42
cluster.favorite-child-policy: mtime
client.ssl: on
server.ssl: on
ssl.private-key: /var/opt/nokia/certs/glusterfs/glusterfs.key
ssl.own-cert: /var/opt/nokia/certs/glusterfs/glusterfs.pem
ssl.ca-list: /var/opt/nokia/certs/glusterfs/glusterfs.ca
cluster.server-quorum-ratio: 51%

[debug info]
[root@mn-0:/mnt/bricks/services/brick/db]
# getfattr -m . -d -e hex upgrade/
# file: upgrade/
system.posix_acl_access=0x0200000001000700ffffffff04000500ffffffff08000700d302000008000700d402000010000700ffffffff20000500ffffffff
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.services-client-1=0x000000000000000000000015
trusted.afr.services-client-2=0x000000000000000000000000
trusted.gfid=0xf9ebed9856fb4e26987c3a890ed5203c
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
[root@mn-1:/mnt/bricks/services/brick/db/upgrade]
# getfattr -m . -d -e hex .
# file: .
system.posix_acl_access=0x0200000001000700ffffffff04000500ffffffff08000700d302000008000700d402000010000700ffffffff20000500ffffffff
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.services-client-0=0x000000000000000000000003
trusted.afr.services-client-2=0x000000000000000000000000
trusted.gfid=0xf9ebed9856fb4e26987c3a890ed5203c
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
[root@dbm-0:/mnt/bricks/services/brick/db/upgrade]
# getfattr -m . -d -e hex .
# file: .
system.posix_acl_access=0x0200000001000700ffffffff04000500ffffffff08000700d302000008000700d402000010000700ffffffff20000500ffffffff
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.services-client-0=0x000000000000000000000000
trusted.afr.services-client-1=0x000000000000000000000000
trusted.gfid=0xf9ebed9856fb4e26987c3a890ed5203c
trusted.glusterfs.dht=0x000000010000000000000000ffffffff

gdb attached to mn-0 glustershd process,
Thread 14 "glustershdheal" hit Breakpoint 10, __afr_selfheal_entry_prepare (frame=frame@entry=0x7f54840321e0, this=this@entry=0x7f548c016980, 
    inode=<optimized out>, locked_on=locked_on@entry=0x7f545effc780 "\001\001\001dT\177", sources=sources@entry=0x7f545effc7c0 "", 
    sinks=sinks@entry=0x7f545effc7b0 "", healed_sinks=<optimized out>, replies=<optimized out>, source_p=<optimized out>, pflag=<optimized out>)
    at afr-self-heal-entry.c:546
546 in afr-self-heal-entry.c
(gdb) print heald_sinks[0]
No symbol "heald_sinks" in current context.
(gdb) print healed_sinks[0]
value has been optimized out
(gdb) print source
$12 = 2
(gdb) print sinks[0]
$13 = 0 '\000'
(gdb) print sinks[1]
$14 = 0 '\000'
(gdb) print sinks[2]
$15 = 0 '\000'
(gdb) print locked_on[0]
$16 = 1 '\001'
(gdb) print locked_on[1]
$17 = 1 '\001'
(gdb) print locked_on[2]
$18 = 1 '\001'

According to the code in __afr_selfheal_entry, each time of heal , because the head_sinks is all 0 so “if (AFR_COUNT(healed_sinks, priv->child_count) == 0)” will goto unlock, and skip this round of heal, /db/upgrade will keeps showing in “volume heal info" command. Seems current gluster shd code does not handle this kind of situation, but I think if it keeps showing up, it is not very perfect.
Any idea how to improve this?

Version-Release number of selected component (if applicable):
glusterfs 3.12.15

How reproducible:

reboot 3 storage nodes at the same time.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Karthik U S 2019-08-14 04:57:07 UTC
According to the gdb output, healed_sinks are not set to zero, instead they are optimized out, due to which you are not able to see the values set in that.
Please attach the gluster logs from all the nodes to debug this further.

Since the gluster version is 3.12.15, please file a new bug under https://bugzilla.redhat.com/enter_bug.cgi?product=GlusterFS with the logs attached. I'm closing this bug as this is a community project.

Comment 3 zhou lin 2019-08-15 02:26:50 UTC
each time we find this issue, the change logs are very alike, two node are mutually accused each other, and the third node does not accuse anyone, at the same time the node with this entry in "gluster volume heal info" glustershd log are like following,it seems each round of scan the shd does not complete, instead it dropped out , because it can not find head_sinks, and according to afr-self-heal-entry.c, when count of head_sinks is zero, it just skip the following repair steps.

[root@mn-0:/mnt/bricks/ccs/brick/.glusterfs/indices/xattrop]
# gluster v heal ccs info
Brick mn-0.local:/mnt/bricks/ccs/brick
/ 
Status: Connected
Number of entries: 1

Brick mn-1.local:/mnt/bricks/ccs/brick
Status: Connected
Number of entries: 0

Brick dbm-0.local:/mnt/bricks/ccs/brick
/ 
Status: Connected
Number of entries: 1

[root@mn-0:/root]
# getfattr -m . -d -e hex /mnt/bricks/ccs/brick/
getfattr: Removing leading '/' from absolute path names
# file: mnt/bricks/ccs/brick/
trusted.afr.ccs-client-1=0x000000000000000000000000
trusted.afr.ccs-client-2=0x000000000000000000000002
trusted.afr.dirty=0x000000000000000000000000
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.volume-id=0x15301766900e4e9fb0b87c6b3f6e90f0

[root@mn-1:/home/robot]
# getfattr -m . -d -e hex /mnt/bricks/ccs/brick
getfattr: Removing leading '/' from absolute path names
# file: mnt/bricks/ccs/brick
trusted.afr.ccs-client-0=0x000000000000000000000000
trusted.afr.ccs-client-2=0x000000000000000000000000
trusted.afr.dirty=0x000000000000000000000000
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.volume-id=0x15301766900e4e9fb0b87c6b3f6e90f0

[root@dbm-0:/root]
# getfattr -m . -d -e hex /mnt/bricks/ccs/brick/
getfattr: Removing leading '/' from absolute path names
# file: mnt/bricks/ccs/brick/
trusted.afr.ccs-client-0=0x000000000000000000000004
trusted.afr.ccs-client-1=0x000000000000000000000000
trusted.afr.dirty=0x000000000000000000000000
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.volume-id=0x15301766900e4e9fb0b87c6b3f6e90f0
[2019-07-13 02:05:28.001785] I [MSGID: 108026] [afr-self-heald.c:432:afr_shd_index_heal] 0-ccs-replicate-0: got entry: 00000000-0000-0000-0000-000000000001 from ccs-client-0
[2019-07-13 02:05:28.002066] I [MSGID: 108026] [afr-self-heald.c:341:afr_shd_selfheal] 0-ccs-replicate-0: entry: path /, gfid: 00000000-0000-0000-0000-000000000001
[2019-07-13 02:06:29.001650] I [MSGID: 108026] [afr-self-heald.c:432:afr_shd_index_heal] 0-ccs-replicate-0: got entry: 00000000-0000-0000-0000-000000000001 from ccs-client-0
[2019-07-13 02:06:29.001986] I [MSGID: 108026] [afr-self-heald.c:341:afr_shd_selfheal] 0-ccs-replicate-0: entry: path /, gfid: 00000000-0000-0000-0000-000000000001
[2019-07-13 02:07:30.003468] I [MSGID: 108026] [afr-self-heald.c:432:afr_shd_index_heal] 0-ccs-replicate-0: got entry: 00000000-0000-0000-0000-000000000001 from ccs-client-0
[2019-07-13 02:07:30.004325] I [MSGID: 108026] [afr-self-heald.c:341:afr_shd_selfheal] 0-ccs-replicate-0: entry: path /, gfid: 00000000-0000-0000-0000-000000000001
[2019-07-13 02:08:31.001744] I [MSGID: 108026] [afr-self-heald.c:432:afr_shd_index_heal] 0-ccs-replicate-0: got entry: 00000000-0000-0000-0000-000000000001 from ccs-client-0
[2019-07-13 02:08:31.002067] I [MSGID: 108026] [afr-self-heald.c:341:afr_shd_selfheal] 0-ccs-replicate-0: entry: path /, gfid: 00000000-0000-0000-0000-000000000001
[2019-07-13 02:09:15.002043] I [MSGID: 108026] [afr-self-heald.c:432:afr_shd_index_heal] 0-encryptfile-replicate-0: got entry: 00000000-0000-0000-0000-000000000001 from encryptfile-client-0
[2019-07-13 02:09:15.002679] I [MSGID: 108026] [afr-self-heald.c:341:afr_shd_selfheal] 0-encryptfile-replicate-0: entry: path /, gfid: 00000000-0000-0000-0000-000000000001
[2019-07-13 02:09:32.001682] I [MSGID: 108026] [afr-self-heald.c:432:afr_shd_index_heal] 0-ccs-replicate-0: got entry: 00000000-0000-0000-0000-000000000001 from ccs-client-0
[2019-07-13 02:09:32.002015] I [MSGID: 108026] [afr-self-heald.c:341:afr_shd_selfheal] 0-ccs-replicate-0: entry: path /, gfid: 00000000-0000-0000-0000-000000000001

Comment 4 Karthik U S 2019-08-16 12:34:55 UTC
The volume for which heal info was given in the description and the one given in the last comment are different.
- How many volumes have this problem?
- Please provide the gluster logs to debug this further.

Comment 5 zhou lin 2019-08-17 06:25:42 UTC
Created attachment 1605200 [details]
mn-0 node glustershd log+ services brick glsuterfsd log

Comment 6 zhou lin 2019-08-17 06:32:35 UTC
this issue happens, not very often, but if keep running reboot case, there is a change to happen
both the volume services and volume ccs have happened this issue, and each time the changelog is the same pattern, two storage nodes accuse each other, and the third one blame nothing,
i study the glusterfs source code, and feels in this case __afr_selfheal_entry_prepare can not decide the healed_sinks, even with the latest code, i think it should be the same

i do not have log right now, maybe next time it happened, i can collect.

Comment 7 Karthik U S 2019-08-26 10:39:27 UTC
Hi Cynthia,

Appreciate your efforts on finding the root cause for this issue. Yes you are right. In __afr_selfheal_entry_prepare() it is not setting the bricks which are needing heal as healed_sinks in this case. I created this locally by setting the required xattrs and creating the gfid entries manually on the backend. Will work on the fix for this.

Comment 8 zhou lin 2019-08-27 02:26:51 UTC
thanks! looking forward for your fix patch :)!

Comment 9 Hunang Shujun 2019-08-27 05:42:22 UTC
the healed_sinks is empty is because afr_selfheal_find_direction do not find any "sink". In the function, only the node who accuse by source node can be decided as sink, other accuse node will not be identified as sink.  The rule is valid or not?  Any reason?
  for (i = 0; i < priv->child_count; i++) {
                if (!sources[i])---> the accuse info will not be taken into consider when the node is not source
                        continue;
                if (self_accused[i])
                        continue;
                for (j = 0; j < priv->child_count; j++) {
                        if (matrix[i][j])
                                sinks[j] = 1;
                }
        }

Comment 10 Karthik U S 2019-08-27 06:59:32 UTC
(In reply to Hunang  Shujun from comment #9)
> the healed_sinks is empty is because afr_selfheal_find_direction do not find
> any "sink". In the function, only the node who accuse by source node can be
> decided as sink, other accuse node will not be identified as sink.  The rule
> is valid or not?  Any reason?
>   for (i = 0; i < priv->child_count; i++) {
>                 if (!sources[i])---> the accuse info will not be taken into
> consider when the node is not source
>                         continue;
>                 if (self_accused[i])
>                         continue;
>                 for (j = 0; j < priv->child_count; j++) {
>                         if (matrix[i][j])
>                                 sinks[j] = 1;
>                 }
>         }

This is a valid code. Here we consider only those bricks which are not blamed by any of the non-accused bricks as sinks. Then in __afr_selfheal_entry_prepare() we will intersect the locked_on and sinks to populate the healed_sinks. After this __afr_selfheal_entry_finalize_source() will be called which attempts to mark all the bricks which are not source as healed_sinks.

    sources_count = AFR_COUNT(sources, priv->child_count);                                                                                                       
    if ((AFR_CMP(locked_on, healed_sinks, priv->child_count) == 0) ||               
        !sources_count || afr_does_witness_exist(this, witness)) {   -------> These condition does not hold true in this case so it fails to mark the non-sources as sinks            
        memset(sources, 0, sizeof(*sources) * priv->child_count);                   
        afr_mark_active_sinks(this, sources, locked_on, healed_sinks);              
        return -1;                                                                  
    }                                                                               
                                                                                    
    source = afr_choose_source_by_policy(priv, sources, AFR_ENTRY_TRANSACTION); 
    return source;

We need to handle this case separately where we have source set but there is no brick marked as sink. Since this is happening for entry heal we can not directly consider all the other bricks as sinks, which might lead to data loss. So the best way would be to do conservative merge here. I will check whether this happens for data & metadata heal case as well (ideally it should not) and then send a patch to fix this.

Comment 11 Hunang Shujun 2019-08-27 07:18:47 UTC
I am appreciate for your detail explaination. :)

Comment 12 Karthik U S 2019-10-11 08:34:04 UTC
Hi,

This bug is fixed in the latest master by the patch [1], which is tracked by BZ #1749322. I have done the backports to other maintained branches.
The fix should be part of the next releases. Since the 4.1 branch is no longer maintained I am closing this bug.

[1] https://review.gluster.org/#/c/glusterfs/+/23364/

Regards,
Karthik

Comment 13 zhou lin 2019-10-14 07:14:41 UTC
good!thanks!