Bug 1618932

Summary: dht-selfheal.c: Directory selfheal failed
Product: [Community] GlusterFS Reporter: frostyplanet
Component: distributeAssignee: Susant Kumar Palai <spalai>
Status: CLOSED UPSTREAM QA Contact:
Severity: urgent Docs Contact:
Priority: medium    
Version: mainlineCC: bugs, frostyplanet, jahernan, ksubrahm, pasik, pkarampu, ravishankar
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-12 12:49:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
gfapi log
none
error log in brick
none
glfsheal log
none
state dump of offline cluster
none
glfs heal trace level log with version 3.12.11
none
glfs heal trace level log with version 3.12.1
none
gfapi log for io error in 3.12.14 none

Description frostyplanet 2018-08-18 12:49:41 UTC
Created attachment 1476762 [details]
gfapi log

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:

There's mulitple application using gfapi concurrently creating file in the same directory
(e51fd83622674cc9) and (e21ea6832d2b13d0) are log from different application processes.


application log  
-----------------
timezone is GMT+8

2018-08-18 19:35:03,703 DEBUG -31021968- writing to file cluster=4 FS/rt/mbXx/service-log_0/0760dee6406533f5aefa43f83bdd8918_171654947375444628.m/0004_bfab2d1ea2da11e8a3196c92bf5c1b88  (app:1461)(e51fd83622674cc9)
2018-08-18 19:35:03,734 DEBUG -32369552- writing to file cluster=4 FS/rt/mbXx/service-log_0/0760dee6406533f5aefa43f83bdd8918_171654947375444628.m/0001_bfafdf58a2da11e8a3196c92bf5c1b88  (app:1461)(e21ea6832d2b13d0)
2018-08-18 19:35:03,786 DEBUG -31022448- Create new directory [FS/rt/mbXx/service-log_0/0760dee6406533f5aefa43f83bdd8918_171654947375444628.m] on cluster [4] ((unknown file):           0)(e51fd83622674cc9)
2018-08-18 19:35:03,795 CRITICAL -31021968- Failed to open cluster [4] object [FS/rt/mbXx/service-log_0/0760dee6406533f5aefa43f83bdd8918_171654947375444628.m/                           0004_bfab2d1ea2da11e8a3196c92bf5c1b88] with mode [w]: [[Errno 5] Input/output error] (app:1461)(e51fd83622674cc9)
2018-08-18 19:35:03,903 DEBUG -32366672- Directory [FS/rt/mbXx/service-log_0/0760dee6406533f5aefa43f83bdd8918_171654947375444628.m] exists on cluster [4] ((unknown file):               0)(e21ea6832d2b13d0)
2018-08-18 19:35:03,945 DEBUG -32369552- Open cluster [4] file [FS/rt/mbXx/service-log_0/0760dee6406533f5aefa43f83bdd8918_171654947375444628.m/0001_bfafdf58a2da11e8a3196c92bf5c1b88]    with mode [w] (app:1461)(e21ea6832d2b13d0)
2018-08-18 19:35:04,127 DEBUG -31021968- Open cluster [4] file [FS/rt/mbXx/service-log_0/0760dee6406533f5aefa43f83bdd8918_171654947375444628.m/0004_bfab2d1ea2da11e8a3196c92bf5c1b88]    with mode [w] (app:1461)(e51fd83622674cc9)
2018-08-18 19:35:04,391 INFO -32369552- Rename file: cluster=4 src=FS/rt/mbXx/service-log_0/0760dee6406533f5aefa43f83bdd8918_171654947375444628.m/0001_bfafdf58a2da11e8a3196c92bf5c1b88  dst=FS/rt/mbXx/service-log_0/0760dee6406533f5aefa43f83bdd8918_171654947375444628.m/0001 (app:1461)(e21ea6832d2b13d0)
2018-08-18 19:35:04,485 INFO -31021968- Rename file: cluster=4 src=FS/rt/mbXx/service-log_0/0760dee6406533f5aefa43f83bdd8918_171654947375444628.m/0004_bfab2d1ea2da11e8a3196c92bf5c1b88  dst=FS/rt/mbXx/service-log_0/0760dee6406533f5aefa43f83bdd8918_171654947375444628.m/0004 (app:1461)(e51fd83622674cc9)


Actual results:

IO error happended when creating file, success after retry
 
dht-selfheal failure is observed  in gfapi log, there is unmatched inode unlock 
request reported from brick.


Expected results:


Additional info:

"gluster volume status" output is all ok,
but runing "gluster volume heal vol0 info" blocks and no output


gluster volume info
--------------------
 
Volume Name: vol0
Type: Distributed-Replicate
Volume ID: 18e1c05d-570a-4c97-aa91-ef984881c4f2
Status: Started
Snapshot Count: 0
Number of Bricks: 36 x 3 = 108
Transport-type: tcp

Options Reconfigured:
locks.trace: false
client.event-threads: 6
cluster.self-heal-daemon: enable
performance.write-behind: True
transport.keepalive: True
cluster.rebal-throttle: lazy
server.event-threads: 4
performance.io-cache: False
nfs.disable: True
cluster.quorum-type: auto
network.ping-timeout: 120
features.cache-invalidation: False
performance.read-ahead: False
performance.client-io-threads: True
cluster.server-quorum-type: none
performance.md-cache-timeout: 0
performance.readdir-ahead: True

Comment 1 frostyplanet 2018-08-18 12:52:01 UTC
production cluster recently upgraded from 3.7.6 to gluster-3.10.12

Comment 2 frostyplanet 2018-08-18 13:00:04 UTC
Created attachment 1476763 [details]
error log in brick

no lock found when doing inode unlock, it seems unblock request was from dht self-heal

Comment 3 frostyplanet 2018-08-18 13:12:37 UTC
Created attachment 1476765 [details]
glfsheal log

there was many "has not responded in the last 120 seconds, disconnecting." in it

Comment 4 frostyplanet 2018-08-20 04:47:39 UTC
Input/output seen on 0.01% operations, reproduce in every hour

Comment 5 frostyplanet 2018-08-20 04:52:59 UTC
*** Bug 1618925 has been marked as a duplicate of this bug. ***

Comment 6 frostyplanet 2018-08-20 04:56:16 UTC
Host eviroment
Centos 7.2
3.10.0-327.18.2.el7.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux

Comment 7 frostyplanet 2018-08-20 13:38:32 UTC
I have another offline storage cluster to reproduce the "gluster heal info" hanging problem. "killall glusterd glusterfsd" then restart glusterd did not fix it.

I have triggered state dump, attach them later.

Comment 8 frostyplanet 2018-08-20 13:42:04 UTC
Created attachment 1477213 [details]
state dump of offline cluster

Another offline cluster which upgraded from 3.7.6 to 3.10.12. "gluster volum heal info hang" after stress test. These are state dump file.

Comment 9 frostyplanet 2018-08-21 08:53:56 UTC
gdb on "glfsheal vol0" command, stuck at heal/src/glfs-heal.c:1401

Comment 10 frostyplanet 2018-08-21 12:15:37 UTC
upgraded to 3.12.11, "glfsheal vol0" stuck problem still persist

Comment 11 frostyplanet 2018-09-14 22:46:12 UTC
I have revert the following two commit in release-3.12 branch to get "gluster volume heal vol0 info" working properly.

It may lead to blocking client.event-thread. The situation was described in

https://lists.gluster.org/pipermail/gluster-users/2018-September/034871.html
https://lists.gluster.org/pipermail/gluster-users/2018-September/034876.html

commit 086f1d088e257a3abce68a1807692c36172a8c0f
Author: Xavier Hernandez <jahernan>
Date:   Thu Feb 1 16:06:32 2018 +0100

    cluster/afr: remove unnecessary child_up initialization
    
    The child_up array was initialized with all elements being -1 to
    allow afr_notify() to differentiate down bricks from bricks that
    haven't reported yet. With current implementation this is not needed
    anymore and it was causing unexpected results when other parts of
    the code considered that if child_up[i] != 0, it meant that it was up.
    
    Backport of:
    > BUG: 1541038
    
    Change-Id: I2a9d712ee64c512f24bd5cd3a48dcb37e3139472
    BUG: 1541930
    Signed-off-by: Xavier Hernandez <jahernan>


commit 94faf8c4a1f5a020593ca65f44d68e1ba2f632eb
Author: karthik-us <ksubrahm>
Date:   Wed Sep 20 18:30:21 2017 +0530

    cluster/afr: Sending subvol up/down events when subvol comes up or goes down
    
    > BUG: 1493539
    
    (cherry picked from commit 3bbb4fe4b33dc3a3ed068ed2284077f2a4d8265a)
    
    Change-Id: I6580351b245d5f868e9ddc6a4eb4dd6afa3bb6ec
    BUG: 1492066
    Signed-off-by: karthik-us <ksubrahm>

Comment 12 Pranith Kumar K 2018-09-20 07:20:25 UTC
hi,
     I went through the logs, as you pointed out, issue seems to be the ping-timeout. It is not clear as to why it is happening though. Ping timeout generally happens when the brick doesn't respond in ping-timeout seconds, which it looks to be configured as 120 seconds. Do you have this problem only with heal info command or the mount's I/O is also hanging (as per your bz description it doesn't look like it, but want a confirmation)?

If it only happens with heal info, will it be possible for you to help find the RC? we need to check if there is any dead-lock in the heal-info process.

Pranith

Comment 13 frostyplanet 2018-09-20 12:26:27 UTC
It only happend in "heal info", and always reproduciable.
"gluster volume status" shows all bricks were online. 

Mount by fuse were instant. Write throught gfapi are OK most of the time. 
Except there were very rare case that gfapi write operation return after blocking several hours.

It turned out that "heal info" problem was not related to IO error problem. Our heals are all normal for now (reverted the two commit I pointed out earlier).
IO error problem still reproduciable for the moment.

Comment 14 frostyplanet 2018-09-20 12:39:59 UTC
>> will it be possible for you to help find the RC?

yes, if there are future patches I'm happy to test them out

Comment 15 frostyplanet 2018-09-20 12:49:34 UTC
Created attachment 1485128 [details]
glfs heal trace level log with version 3.12.11

using v3.12.11, change log to trace level, with some attitional debug log added.

Comment 16 frostyplanet 2018-09-20 12:50:41 UTC
Created attachment 1485129 [details]
glfs heal trace level log with version 3.12.1

with version 3.12.1, sucessful run

Comment 17 frostyplanet 2018-09-20 12:59:43 UTC
(In reply to frostyplanet from comment #13)

> Except there were very rare case that gfapi write operation return after
> blocking several hours.

Ignore this, I checked the log it was a bug not related to gluster

Comment 18 frostyplanet 2018-09-20 14:00:45 UTC
Previously in my debuging with v3.12.11, after setting "network.ping-timeout" to 0, the client side behaviour is looping in reconnecting to all the bricks, success after a long time.  If set client.event-thread to 24 (equal to number of bricks), heal info success in 30 seconds. So it leads to my suspect.
 Afterwards I used git bisect to locate commit 94faf8c4a1 and  086f1d08 were the cause.

commit 94faf8c4a1 added two events to fire, but EVENT_AFR_SUBVOL_UP/EVENT_AFR_SUBVOLS_DOWN are not processed in any where of the code.
 
Commit 086f1d08 removed some variable initialization code, seams a regression.

Comment 19 frostyplanet 2018-09-26 22:48:27 UTC
Created attachment 1487508 [details]
gfapi log for io error in 3.12.14

New sample of gfapi log in version 3.12.14. IO error happended while creating file

Comment 20 Shyamsundar 2018-10-23 14:55:03 UTC
Release 3.12 has been EOLd and this bug was still found to be in the NEW state, hence moving the version to mainline, to triage the same and take appropriate actions.

Comment 21 Pranith Kumar K 2018-10-29 08:51:52 UTC
Nitya,
     As per the log in comment-19, dht_create() is failing as it couldn't find any subvolume for doing create I think. When can this happen?

Comment 22 Nithya Balachandran 2018-10-29 11:29:28 UTC
If multiple clients are creating the same directory, it is possible that one client gets an incomplete layout as the mkdir is not complete. It should, however, heal it itself. If for some reason it could not heal the directory we could run into this issue.

Can you set  client-log-level to DEBUG and send us the logs once you hit this? Or better yet, is there a test program I can use to try this out?

Comment 23 Nithya Balachandran 2018-10-29 12:20:23 UTC
(In reply to Nithya Balachandran from comment #22)
> If multiple clients are creating the same directory, it is possible that one
> client gets an incomplete layout as the mkdir is not complete. It should,
> however, heal it itself. If for some reason it could not heal the directory
> we could run into this issue.
> 
> Can you set  client-log-level to DEBUG and send us the logs once you hit
> this? Or better yet, is there a test program I can use to try this out?

Just noticed that trace logs are already provided. I will take a look and get back.

Comment 24 Ravishankar N 2018-11-20 08:34:41 UTC
Changing the component to DHT as a part of triaging. Please assign back to replicate if you believe there is something that needs investigation in AFR.

Comment 25 Nithya Balachandran 2019-11-04 06:30:37 UTC
(In reply to Nithya Balachandran from comment #23)
> (In reply to Nithya Balachandran from comment #22)
> > If multiple clients are creating the same directory, it is possible that one
> > client gets an incomplete layout as the mkdir is not complete. It should,
> > however, heal it itself. If for some reason it could not heal the directory
> > we could run into this issue.
> > 
> > Can you set  client-log-level to DEBUG and send us the logs once you hit
> > this? Or better yet, is there a test program I can use to try this out?
> 
> Just noticed that trace logs are already provided. I will take a look and
> get back.

The logs indicate that the client version is 3.10.
[2018-08-18 11:35:03.685415] D [write-behind.c:1696:wb_process_queue] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7fc082b0ffab] (--> /usr/lib64/glusterfs/3.10.12/xlator/performance/write-behind.so(+0x8adb)[0x7fc073b62adb] (--> /usr/lib64/glusterfs/3.10.12/xlator/performance/write-behind.so(+0xb880)[0x7fc073b65880] (--> /lib64/libglusterfs.so.0(default_lookup+0xbd)[0x7fc082b893bd] (--> /usr/lib64/glusterfs/3.10.12/xlator/performance/quick-read.so(+0x2596)[0x7fc07374c596] ))))) 0-vol0-write-behind: processing queues

This could be because of https://bugzilla.redhat.com/show_bug.cgi?id=1455104 which was fixed in release 3.12. 

Do you still see the I/O errors with  3.12 or later?

Comment 26 Nithya Balachandran 2019-11-04 07:50:00 UTC
(In reply to frostyplanet from comment #19)
> Created attachment 1487508 [details]
> gfapi log for io error in 3.12.14
> 
> New sample of gfapi log in version 3.12.14. IO error happended while
> creating file

I do not see IO errors in the gfapi log. Please provide debug/trace logs when the issue is seen in 3.12

Comment 27 Worker Ant 2020-03-12 12:49:17 UTC
This bug is moved to https://github.com/gluster/glusterfs/issues/957, and will be tracked there from now on. Visit GitHub issues URL for further details

Comment 28 Red Hat Bugzilla 2023-09-14 04:33:19 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days