1760699 – glustershd can not decide heald_sinks, and skip repair, so some entries lingering in volume heal info

Bug 1760699 - glustershd can not decide heald_sinks, and skip repair, so some entries lingering in volume heal info

Summary: glustershd can not decide heald_sinks, and skip repair, so some entries linge...

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	replicate
Sub Component:
Version:	7
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Karthik U S
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1749322
Blocks:
TreeView+	depends on / blocked

Reported:	2019-10-11 06:48 UTC by Karthik U S
Modified:	2019-11-14 13:21 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:	1749322
Environment:
Last Closed:	2019-11-14 13:21:07 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Gluster.org Gerrit	23541	0	None	Merged	cluster/afr: Heal entries when there is a source & no healed_sinks	2019-11-14 13:21:05 UTC

Description Karthik U S 2019-10-11 06:48:56 UTC

+++ This bug was initially created as a clone of Bug #1749322 +++

+++ This bug was initially created as a clone of Bug #1740968 +++

Description of problem:
[root@mn-0:/home/robot]
# gluster v heal services info
Brick mn-0.local:/mnt/bricks/services/brick
/db/upgrade 
Status: Connected
Number of entries: 1

Brick mn-1.local:/mnt/bricks/services/brick
/db/upgrade 
Status: Connected
Number of entries: 1

Brick dbm-0.local:/mnt/bricks/services/brick
Status: Connected
Number of entries: 0

those entries keeps showing in gluster v heal info command,
from glustershd log, each times when glustershd deal with this entry, nothing real is done, from gdb info, shd can not decide the heald_sinks, so nothing is done at each round of repair


[root@mn-0:/home/robot]
# gluster v heal services info
Brick mn-0.local:/mnt/bricks/services/brick
/db/upgrade 
Status: Connected
Number of entries: 1

Brick mn-1.local:/mnt/bricks/services/brick
/db/upgrade 
Status: Connected
Number of entries: 1

Brick dbm-0.local:/mnt/bricks/services/brick
Status: Connected
Number of entries: 0


[Env info]
Three bricks mn-0, mn-1,dbm-0
[root@mn-1:/mnt/bricks/services/brick/db/upgrade]
# gluster v info services

Volume Name: services
Type: Replicate
Volume ID: 062748ce-0876-46f6-9936-d9ff3a2b110a
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: mn-0.local:/mnt/bricks/services/brick
Brick2: mn-1.local:/mnt/bricks/services/brick
Brick3: dbm-0.local:/mnt/bricks/services/brick
Options Reconfigured:
cluster.heal-timeout: 60
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
cluster.server-quorum-type: none
cluster.quorum-type: auto
cluster.quorum-reads: true
cluster.consistent-metadata: on
server.allow-insecure: on
network.ping-timeout: 42
cluster.favorite-child-policy: mtime
client.ssl: on
server.ssl: on
ssl.private-key: /var/opt/nokia/certs/glusterfs/glusterfs.key
ssl.own-cert: /var/opt/nokia/certs/glusterfs/glusterfs.pem
ssl.ca-list: /var/opt/nokia/certs/glusterfs/glusterfs.ca
cluster.server-quorum-ratio: 51%

[debug info]
[root@mn-0:/mnt/bricks/services/brick/db]
# getfattr -m . -d -e hex upgrade/
# file: upgrade/
system.posix_acl_access=0x0200000001000700ffffffff04000500ffffffff08000700d302000008000700d402000010000700ffffffff20000500ffffffff
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.services-client-1=0x000000000000000000000015
trusted.afr.services-client-2=0x000000000000000000000000
trusted.gfid=0xf9ebed9856fb4e26987c3a890ed5203c
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
[root@mn-1:/mnt/bricks/services/brick/db/upgrade]
# getfattr -m . -d -e hex .
# file: .
system.posix_acl_access=0x0200000001000700ffffffff04000500ffffffff08000700d302000008000700d402000010000700ffffffff20000500ffffffff
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.services-client-0=0x000000000000000000000003
trusted.afr.services-client-2=0x000000000000000000000000
trusted.gfid=0xf9ebed9856fb4e26987c3a890ed5203c
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
[root@dbm-0:/mnt/bricks/services/brick/db/upgrade]
# getfattr -m . -d -e hex .
# file: .
system.posix_acl_access=0x0200000001000700ffffffff04000500ffffffff08000700d302000008000700d402000010000700ffffffff20000500ffffffff
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.services-client-0=0x000000000000000000000000
trusted.afr.services-client-1=0x000000000000000000000000
trusted.gfid=0xf9ebed9856fb4e26987c3a890ed5203c
trusted.glusterfs.dht=0x000000010000000000000000ffffffff

gdb attached to mn-0 glustershd process,
Thread 14 "glustershdheal" hit Breakpoint 10, __afr_selfheal_entry_prepare (frame=frame@entry=0x7f54840321e0, this=this@entry=0x7f548c016980, 
    inode=<optimized out>, locked_on=locked_on@entry=0x7f545effc780 "\001\001\001dT\177", sources=sources@entry=0x7f545effc7c0 "", 
    sinks=sinks@entry=0x7f545effc7b0 "", healed_sinks=<optimized out>, replies=<optimized out>, source_p=<optimized out>, pflag=<optimized out>)
    at afr-self-heal-entry.c:546
546 in afr-self-heal-entry.c
(gdb) print heald_sinks[0]
No symbol "heald_sinks" in current context.
(gdb) print healed_sinks[0]
value has been optimized out
(gdb) print source
$12 = 2
(gdb) print sinks[0]
$13 = 0 '\000'
(gdb) print sinks[1]
$14 = 0 '\000'
(gdb) print sinks[2]
$15 = 0 '\000'
(gdb) print locked_on[0]
$16 = 1 '\001'
(gdb) print locked_on[1]
$17 = 1 '\001'
(gdb) print locked_on[2]
$18 = 1 '\001'

According to the code in __afr_selfheal_entry, each time of heal , because the head_sinks is all 0 so “if (AFR_COUNT(healed_sinks, priv->child_count) == 0)” will goto unlock, and skip this round of heal, /db/upgrade will keeps showing in “volume heal info＂　command. Seems current gluster shd code does not handle this kind of situation, but I think if it keeps showing up, it is not very perfect.
Any idea how to improve this?

Comment 1 Worker Ant 2019-10-11 06:54:39 UTC

REVIEW: https://review.gluster.org/23541 (cluster/afr: Heal entries when there is a source & no healed_sinks) posted (#1) for review on release-7 by Karthik U S

Comment 2 Worker Ant 2019-11-14 13:21:07 UTC

REVIEW: https://review.gluster.org/23541 (cluster/afr: Heal entries when there is a source & no healed_sinks) merged (#2) on release-7 by hari gowtham

Note You need to log in before you can comment on or make changes to this bug.