Bug 1159484

Summary: ls -alR can not heal the disperse volume
Product: [Community] GlusterFS Reporter: lidi <lidi>
Component: disperseAssignee: Xavi Hernandez <jahernan>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.6.0CC: gluster-bugs, jahernan, lmohanty
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.6.2 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1161588 (view as bug list) Environment:
Last Closed: 2015-01-28 14:28:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1161588    
Bug Blocks: 1163723    

Description lidi 2014-11-01 08:49:51 UTC
Steps to Reproduce:
1.gluster vol create test disperse 3 redundancy 1 10.10.21.20:/sdb 10.10.21.21:/sdb 10.10.21.22:/sdb force;
2.start the volume and mount it on /cluster2/test
3.cd /cluster2/test
4.mkdir a b c
5.touch a/1 b/2 c/3
6.gluster vol replace-brick test 10.10.21.22:/sdb 10.10.21.23:/sdb commit force
7.on 10.10.21.20 execute 'ls -alR /cluster2/test'

Actual results:
All directories are healed,but all files are not healed.
Then I try read/write/getattr on these file,find out that only write can heal them.

Comment 1 Xavi Hernandez 2014-11-04 12:28:24 UTC
Are you using 3.6.0beta3, the official 3.6.0 or a compiled one (which commit) ?

Comment 2 lidi 2014-11-05 02:05:38 UTC
3.6.0-beta3 and official 3.6.0 have the same problem.

Comment 3 Anand Avati 2014-11-07 12:46:18 UTC
REVIEW: http://review.gluster.org/9073 (ec: Fix self-healing issues.) posted (#1) for review on release-3.6 by Xavier Hernandez (xhernandez)

Comment 4 Xavi Hernandez 2014-11-07 12:52:24 UTC
This change should solve the problem.

Self-heal is normally executed in background. This can result in files not being healed when expected. To force a full self-heal you should use this command:

    find <mount point> -d -exec getfattr -h -n trusted.ec.heal {} \;

This command heals all files in the foreground, so it guarantees that all files will be healed when it finishes.

Comment 5 lidi 2014-11-11 02:40:53 UTC
I've tried this patch,and got a new problem that client hung up.

Steps to Reproduce:
1.gluster vol create test disperse 3 redundancy 1 10.10.21.20:/sdb 10.10.21.21:/sdb 10.10.21.22:/sdb force;
2.start the volume and mount it on /cluster2/test
3.cd /cluster2/test
4.mkdir a b c
5.touch a/1 b/2 c/3
6."vi a/1" and write something to it.
7.enter "wq" to quit

Actual results:
vi hung up,and any operations from another console hung up too.

Comment 6 Xavi Hernandez 2014-11-11 08:24:46 UTC
This is caused by another bug (#1159471). It's probably also related to #1159529. I'm currently working on them.

Comment 7 lidi 2014-11-11 09:12:00 UTC
Yes, you are right.I can reproduce it by release-3.6.

But now I got another problem.

1.gluster vol create test disperse 3 redundancy 1 10.10.21.20:/sdb 10.10.21.21:/sdb 10.10.21.22:/sdb force;
2.start the volume and mount it on /cluster2/test
3.cd /cluster2/test
4.mkdir a b c
5.touch a/1 b/2 c/3
6.gluster vol replace-brick test 10.10.21.22:/sdb 10.10.21.23:/sdb commit force
7.on one console I use gdb to attach to client process, on another console I execute
“find /cluster2/test -d -exec getfattr -h -n trusted.ec.heal {} \;”
8.I set breakpoint at ec_getxattr and ec_manager_heal, after execute "continue" several times, program received signal SIGSEGV.

Additional info:

[root@localhost /]# gdb /usr/sbin/glusterfs core.1851 
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-60.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/glusterfs...done.
[New Thread 1851]
[New Thread 1855]
[New Thread 1860]
[New Thread 1853]
[New Thread 1854]
[New Thread 1859]
[New Thread 1852]
Missing separate debuginfo for 
Try: yum --disablerepo='*' --enablerepo='*-debug*' install /usr/lib/debug/.build-id/b5/d86fbcf0ccb03331e6c7c73897b96845e0a4eb
Missing separate debuginfo for /usr/lib64/libglusterfs.so.0
Try: yum --disablerepo='*' --enablerepo='*-debug*' install /usr/lib/debug/.build-id/82/de127c93585b6969111f719701d181dc59f6e2
Missing separate debuginfo for /usr/lib64/libgfrpc.so.0
Try: yum --disablerepo='*' --enablerepo='*-debug*' install /usr/lib/debug/.build-id/a4/9fa5aa17b5599fa82d283ea1ef51dfaa2393ac
Missing separate debuginfo for /usr/lib64/libgfxdr.so.0
Try: yum --disablerepo='*' --enablerepo='*-debug*' install /usr/lib/debug/.build-id/8f/6cae3d0dcc4b91dc2602340597817b8c49e7e8
Missing separate debuginfo for /usr/lib64/glusterfs/3.6.1/xlator/mount/fuse.so
Try: yum --disablerepo='*' --enablerepo='*-debug*' install /usr/lib/debug/.build-id/1c/c39c5f09d662ab022a0e84d0fa0647ef42e5a9
Missing separate debuginfo for /usr/lib64/glusterfs/3.6.1/rpc-transport/socket.so
Try: yum --disablerepo='*' --enablerepo='*-debug*' install /usr/lib/debug/.build-id/1f/7a53fa3a30c18173c7c0a942f3859d2a1df739
Missing separate debuginfo for /usr/lib64/glusterfs/3.6.1/xlator/protocol/client.so
Try: yum --disablerepo='*' --enablerepo='*-debug*' install /usr/lib/debug/.build-id/47/13baa3f266f706b564d956de49ec0f2f68ab74
Missing separate debuginfo for /usr/lib64/glusterfs/3.6.1/xlator/cluster/disperse.so
Try: yum --disablerepo='*' --enablerepo='*-debug*' install /usr/lib/debug/.build-id/ab/2f7a8aac8da328261e4c0d707fa9414ad657e2
Missing separate debuginfo for /usr/lib64/glusterfs/3.6.1/xlator/cluster/distribute.so
Try: yum --disablerepo='*' --enablerepo='*-debug*' install /usr/lib/debug/.build-id/22/80456883fb3e05418bd48caa23d142a860c044
Missing separate debuginfo for /usr/lib64/glusterfs/3.6.1/xlator/performance/write-behind.so
Try: yum --disablerepo='*' --enablerepo='*-debug*' install /usr/lib/debug/.build-id/2a/5b9a90bbf338f8d2fd7e7cd200a86c0be0d6b0
Missing separate debuginfo for /usr/lib64/glusterfs/3.6.1/xlator/performance/read-ahead.so
Try: yum --disablerepo='*' --enablerepo='*-debug*' install /usr/lib/debug/.build-id/60/62c9a92d46352f0ae47c94ea5977166d133d03
Missing separate debuginfo for /usr/lib64/glusterfs/3.6.1/xlator/performance/io-cache.so
Try: yum --disablerepo='*' --enablerepo='*-debug*' install /usr/lib/debug/.build-id/f1/55a58bff0e18078a2e2e7620a0d3156394b108
Missing separate debuginfo for /usr/lib64/glusterfs/3.6.1/xlator/performance/quick-read.so
Try: yum --disablerepo='*' --enablerepo='*-debug*' install /usr/lib/debug/.build-id/9f/746694ec6ac671e6753b355ba74d46c4e0c873
Missing separate debuginfo for /usr/lib64/glusterfs/3.6.1/xlator/performance/open-behind.so
Try: yum --disablerepo='*' --enablerepo='*-debug*' install /usr/lib/debug/.build-id/31/64fce0e6b706fe76c12d7f1c69a8402096b8e4
Missing separate debuginfo for /usr/lib64/glusterfs/3.6.1/xlator/performance/md-cache.so
Try: yum --disablerepo='*' --enablerepo='*-debug*' install /usr/lib/debug/.build-id/c5/f210dea5c22f7d79003a24fdd2c60b19a9db7e
Missing separate debuginfo for /usr/lib64/glusterfs/3.6.1/xlator/debug/io-stats.so
Try: yum --disablerepo='*' --enablerepo='*-debug*' install /usr/lib/debug/.build-id/f1/764f55abf3635a55113160c7152e46534a7c08
Missing separate debuginfo for /usr/lib64/glusterfs/3.6.1/xlator/meta.so
Try: yum --disablerepo='*' --enablerepo='*-debug*' install /usr/lib/debug/.build-id/35/ef876a6d5bbc46b9b878448f355ed33d4d1212
Reading symbols from /usr/lib64/libglusterfs.so.0...done.
Loaded symbols for /usr/lib64/libglusterfs.so.0
Reading symbols from /usr/lib64/libgfrpc.so.0...done.
Loaded symbols for /usr/lib64/libgfrpc.so.0
Reading symbols from /usr/lib64/libgfxdr.so.0...done.
Loaded symbols for /usr/lib64/libgfxdr.so.0
Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /usr/lib64/libcrypto.so.10...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libcrypto.so.10
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libz.so.1
Reading symbols from /usr/lib64/glusterfs/3.6.1/xlator/mount/fuse.so...done.
Loaded symbols for /usr/lib64/glusterfs/3.6.1/xlator/mount/fuse.so
Reading symbols from /usr/lib64/glusterfs/3.6.1/rpc-transport/socket.so...done.
Loaded symbols for /usr/lib64/glusterfs/3.6.1/rpc-transport/socket.so
Reading symbols from /usr/lib64/libssl.so.10...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libssl.so.10
Reading symbols from /lib64/libgssapi_krb5.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libgssapi_krb5.so.2
Reading symbols from /lib64/libkrb5.so.3...(no debugging symbols found)...done.
Loaded symbols for /lib64/libkrb5.so.3
Reading symbols from /lib64/libcom_err.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libcom_err.so.2
Reading symbols from /lib64/libk5crypto.so.3...(no debugging symbols found)...done.
Loaded symbols for /lib64/libk5crypto.so.3
Reading symbols from /lib64/libkrb5support.so.0...(no debugging symbols found)...done.
Loaded symbols for /lib64/libkrb5support.so.0
Reading symbols from /lib64/libkeyutils.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libkeyutils.so.1
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /lib64/libselinux.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libselinux.so.1
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /usr/lib64/glusterfs/3.6.1/xlator/protocol/client.so...done.
Loaded symbols for /usr/lib64/glusterfs/3.6.1/xlator/protocol/client.so
Reading symbols from /usr/lib64/glusterfs/3.6.1/xlator/cluster/disperse.so...done.
Loaded symbols for /usr/lib64/glusterfs/3.6.1/xlator/cluster/disperse.so
Reading symbols from /usr/lib64/glusterfs/3.6.1/xlator/cluster/distribute.so...done.
Loaded symbols for /usr/lib64/glusterfs/3.6.1/xlator/cluster/distribute.so
Reading symbols from /usr/lib64/glusterfs/3.6.1/xlator/performance/write-behind.so...done.
Loaded symbols for /usr/lib64/glusterfs/3.6.1/xlator/performance/write-behind.so
Reading symbols from /usr/lib64/glusterfs/3.6.1/xlator/performance/read-ahead.so...done.
Loaded symbols for /usr/lib64/glusterfs/3.6.1/xlator/performance/read-ahead.so
Reading symbols from /usr/lib64/glusterfs/3.6.1/xlator/performance/io-cache.so...done.
Loaded symbols for /usr/lib64/glusterfs/3.6.1/xlator/performance/io-cache.so
Reading symbols from /usr/lib64/glusterfs/3.6.1/xlator/performance/quick-read.so...done.
Loaded symbols for /usr/lib64/glusterfs/3.6.1/xlator/performance/quick-read.so
Reading symbols from /usr/lib64/glusterfs/3.6.1/xlator/performance/open-behind.so...done.
Loaded symbols for /usr/lib64/glusterfs/3.6.1/xlator/performance/open-behind.so
Reading symbols from /usr/lib64/glusterfs/3.6.1/xlator/performance/md-cache.so...done.
Loaded symbols for /usr/lib64/glusterfs/3.6.1/xlator/performance/md-cache.so
Reading symbols from /usr/lib64/glusterfs/3.6.1/xlator/debug/io-stats.so...done.
Loaded symbols for /usr/lib64/glusterfs/3.6.1/xlator/debug/io-stats.so
Reading symbols from /usr/lib64/glusterfs/3.6.1/xlator/meta.so...done.
Loaded symbols for /usr/lib64/glusterfs/3.6.1/xlator/meta.so
Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libgcc_s.so.1
Core was generated by `/usr/sbin/glusterfs --volfile-server=127.0.0.1 --volfile-id=/test /cluster2/tes'.
Program terminated with signal 11, Segmentation fault.
#0  0x000000000186f840 in ?? ()
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.107.el6.x86_64 keyutils-libs-1.4-4.el6.x86_64 krb5-libs-1.10.3-10.el6.x86_64 libcom_err-1.41.12-14.el6.x86_64 libgcc-4.4.7-3.el6.x86_64 libselinux-2.0.94-5.3.el6.x86_64 openssl-1.0.0-27.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0  0x000000000186f840 in ?? ()
#1  0x00007ff69ebab0f5 in ec_getxattr_heal_cbk (frame=0x186ad2c, cookie=0x1869efc, xl=0x1840700, op_ret=-1, op_errno=5, mask=0, good=0, bad=0, xdata=0x0) at ec-inode-read.c:402
#2  0x00007ff69ebfa7bf in ec_manager_heal (fop=0x1869efc, state=-4) at ec-heal.c:1482
#3  0x00007ff69eb949fc in __ec_manager (fop=0x1869efc, error=5) at ec-common.c:1478
#4  0x00007ff69eb91e6d in ec_resume (fop=0x1869efc, error=5) at ec-common.c:312
#5  0x00007ff69ebf9dee in ec_heal_dispatch (heal=0x186f840) at ec-heal.c:1219
#6  0x00007ff69ebfa552 in ec_manager_heal (fop=0x186f14c, state=-218) at ec-heal.c:1417
#7  0x00007ff69eb949fc in __ec_manager (fop=0x186f14c, error=5) at ec-common.c:1478
#8  0x00007ff69eb91e6d in ec_resume (fop=0x186f14c, error=0) at ec-common.c:312
#9  0x00007ff69eb91eba in ec_resume_parent (fop=0x186d29c, error=0) at ec-common.c:326
#10 0x00007ff69eb8ee7d in ec_fop_data_release (fop=0x186d29c) at ec-data.c:249
#11 0x00007ff69eb91ff7 in ec_complete (fop=0x186d29c) at ec-common.c:366
#12 0x00007ff69eb9afab in ec_entrylk_cbk (frame=0x186e014, cookie=0x1, this=0x1840700, op_ret=-1, op_errno=107, xdata=0x0) at ec-locks.c:189
#13 0x00007ff69ee2b00e in client3_3_entrylk_cbk (req=0x183ffbc, iov=0x7fffec68ffe0, count=1, myframe=0x1842a7c) at client-rpc-fops.c:1624
#14 0x00007ff6aadd4e2c in saved_frames_unwind (saved_frames=0x1834890) at rpc-clnt.c:366
#15 0x00007ff6aadd4ecb in saved_frames_destroy (frames=0x1834890) at rpc-clnt.c:383
#16 0x00007ff6aadd533f in rpc_clnt_connection_cleanup (conn=0x1863460) at rpc-clnt.c:533
#17 0x00007ff6aadd5d3e in rpc_clnt_notify (trans=0x18637e0, mydata=0x1863460, event=RPC_TRANSPORT_DISCONNECT, data=0x18637e0) at rpc-clnt.c:840
#18 0x00007ff6aadd2544 in rpc_transport_notify (this=0x18637e0, event=RPC_TRANSPORT_DISCONNECT, data=0x18637e0) at rpc-transport.c:516
#19 0x00007ff6a0660b31 in socket_event_poll_err (this=0x18637e0) at socket.c:1160
#20 0x00007ff6a06654a6 in socket_event_handler (fd=12, idx=1, data=0x18637e0, poll_in=1, poll_out=0, poll_err=16) at socket.c:2345
#21 0x00007ff6ab074dfb in event_dispatch_epoll_handler (event_pool=0x17e8510, events=0x1808410, i=2) at event-epoll.c:384
#22 0x00007ff6ab074ff5 in event_dispatch_epoll (event_pool=0x17e8510) at event-epoll.c:445
#23 0x00007ff6ab04221b in event_dispatch (event_pool=0x17e8510) at event.c:113
#24 0x0000000000409790 in main (argc=4, argv=0x7fffec691968) at glusterfsd.c:2043

Comment 8 Xavi Hernandez 2014-11-17 10:03:30 UTC
There was a bug in the error path of self-heal. I've solved it and will update the patch. However I haven't seen any self-heal error reproducing your steps. Have you forced an error somehow ?

Comment 9 lidi 2014-11-17 10:38:01 UTC
I step through the source before execute "continue", maybe the connection is timeout.

Comment 10 Anand Avati 2014-11-17 11:14:21 UTC
REVIEW: http://review.gluster.org/9073 (ec: Fix self-healing issues.) posted (#2) for review on release-3.6 by Xavier Hernandez (xhernandez)

Comment 11 Anand Avati 2014-12-19 10:58:46 UTC
REVIEW: http://review.gluster.org/9073 (ec: Fix self-healing issues.) posted (#3) for review on release-3.6 by Xavier Hernandez (xhernandez)

Comment 12 Anand Avati 2014-12-21 16:04:27 UTC
COMMIT: http://review.gluster.org/9073 committed in release-3.6 by Raghavendra Bhat (raghavendra) 
------
commit d2bde251a93968a03b7b49b3127f29ebdce73b13
Author: Xavier Hernandez <xhernandez>
Date:   Fri Nov 7 12:12:19 2014 +0100

    ec: Fix self-healing issues.
    
    Three problems have been detected:
    
    1. Self healing is executed in background, allowing the fop that
       detected the problem to continue without blocks nor delays.
    
       While this is quite interesting to avoid unnecessary delays,
       it can cause spurious failures of self-heal because it may
       try to recover a file inside a directory that a previous
       self-heal has not recovered yet, causing the file self-heal
       to fail.
    
    2. When a partial self-heal is being executed on a directory,
       if a full self-heal is attempted, it won't be executed
       because another self-heal is already in process, so the
       directory won't be fully repaired.
    
    3. Information contained in loc's of some fop's is not enough
       to do a complete self-heal.
    
    To solve these problems, I've made some changes:
    
    * Improved ec_loc_from_loc() to add all available information
      to a loc.
    
    * Before healing an entry, it's parent is checked and partially
      healed if necessary to avoid failures.
    
    * All heal requests received for the same inode while another
      self-heal is being processed are queued. When the first heal
      completes, all pending requests are answered using the results
      of the first heal (without full execution), unless the first
      heal was a partial heal. In this case all partial heals are
      answered, and the first full heal is processed normally.
    
    * An special virtual xattr (not physically stored on bricks)
      named 'trusted.ec.heal' has been created to allow synchronous
      self-heal of files.
    
      Now, the recommended way to heal an entire volume is this:
    
        find <mount> -d -exec getfattr -h -n trusted.ec.heal {} \;
    
    Some minor changes:
    
    * ec_loc_prepare() has been renamed to ec_loc_update().
    
    * All loc management functions return 0 on success and -1 on
      error.
    
    * Do not delay fop unlocks if heal is needed.
    
    * Added basic ec xattrs initially on create, mkdir and mknod
      fops.
    
    * Some coding style changes
    
    This is a backport of http://review.gluster.org/9072/
    
    Change-Id: I2a5fd9c57349a153710880d6ac4b1fa0c1475985
    BUG: 1159484
    Signed-off-by: Xavier Hernandez <xhernandez>
    Reviewed-on: http://review.gluster.org/9073
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Raghavendra Bhat <raghavendra>

Comment 13 Raghavendra Bhat 2015-02-11 09:11:42 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.6.2, please reopen this bug report.

glusterfs-3.6.2 has been announced on the Gluster Developers mailinglist [1], packages for several distributions should already be or become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

The fix for this bug likely to be included in all future GlusterFS releases i.e. release > 3.6.2.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/5978
[2] http://news.gmane.org/gmane.comp.file-systems.gluster.user
[3] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/6137