Bug 1034085 - Self-heal errors with "afr crawl failed for child 0 with ret -1" while performing rolling upgrade.
Summary: Self-heal errors with "afr crawl failed for child 0 with ret -1" while perfo...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: unclassified
Version: pre-release
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
Assignee: Anuradha
QA Contact:
URL:
Whiteboard:
Depends On: 1020995
Blocks: 1112111
TreeView+ depends on / blocked
 
Reported: 2013-11-25 08:58 UTC by Anuradha
Modified: 2016-09-20 02:00 UTC (History)
13 users (show)

Fixed In Version: glusterfs-3.6.0beta1
Doc Type: Bug Fix
Doc Text:
Clone Of: 1020995
: 1112111 (view as bug list)
Environment:
Last Closed: 2014-11-11 08:24:48 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)

Description Anuradha 2013-11-25 08:58:12 UTC
+++ This bug was initially created as a clone of Bug #1020995 +++

Description of problem:
Testing upgrade scenario as specified in bug 1009151#c3. I setup three 2x2 volumes[1] and do FOP[2] from multiple clients. I take one node down by stopping glusterd and pkill glusterfs processes. Post upgrade when I start glusterd, self-heal triggers as expected, note that I have only ~35G of data across all the 3 volumes. I notice self-heal success messages[3] for an hour or so as I continue to do FOPS[2] from the client and then see self-heal error messages[4], only vol heal info, vol status, vol info commands were executed during this entire activity.


Version-Release number of selected component (if applicable):
3 nodes: glusterfs-server-3.4.0.33rhs-1.el6rhs.x86_64
1 node: upgraded to glusterfs-server-3.4.0.35rhs-1.el6rhs.x86_64

How reproducible: Testing rolling upgrade and hit this the first time itself.


Steps to Reproduce:
1. Create 3 2x2 volumes.
2. Mount them on multiple clients.
3. Do FOPS[2] and ensure to have around 25-30G of data.
4. Bring down node4 (glusterd stop, pkill glusterfs processes) as you continue to do FOPS.
5. Perform upgrade activity. (yum update)
6. Reboot node4.
7. service glusterd start (since I chkconfig'ed it off before reboot).
8. Observe /var/log/glusterfs/glustershd.log log messages.

Actual results:
Initially self-heal is successful, then after an hour or so it errors.


Expected results: self-heal should keep running successfully.


Additional info:
[1]
[root@ninja-rhs2 ~]# gluster vol info
 
Volume Name: existing-files
Type: Distributed-Replicate
Volume ID: d15fd04b-5664-4eb4-8ad1-1dfb986131c3
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.43.130:/rhs/existing-files/brick1
Brick2: 10.70.43.26:/rhs/existing-files/brick1
Brick3: 10.70.43.103:/rhs/existing-files/brick1
Brick4: 10.70.43.183:/rhs/existing-files/brick1
 
Volume Name: busy-files
Type: Distributed-Replicate
Volume ID: 5385f5c7-7fc1-462e-8aad-610510666adc
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.43.130:/rhs/busy-files/brick1
Brick2: 10.70.43.26:/rhs/busy-files/brick1
Brick3: 10.70.43.103:/rhs/busy-files/brick1
Brick4: 10.70.43.183:/rhs/busy-files/brick1
Options Reconfigured:
storage.owner-uid: 107
storage.owner-gid: 107
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
 
Volume Name: new-files
Type: Distributed-Replicate
Volume ID: 84567149-7dee-4680-9b2f-230df247fbef
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.43.130:/rhs/new-files/brick1
Brick2: 10.70.43.26:/rhs/new-files/brick1
Brick3: 10.70.43.103:/rhs/new-files/brick1
Brick4: 10.70.43.183:/rhs/new-files/brick1
[root@ninja-rhs2 ~]# 


[2] Create files in directory MNT_PT in the directory layout of 10 directories in each level, and 10 directories depth and in each directory, create 1000 sparse files of size 100K and delete some.

[3] from node3:
[2013-10-18 15:38:17.318016] I [afr-self-heal-common.c:2840:afr_log_self_heal_completion_status] 0-new-files-replicate-1:  metadata self heal  is succ
essfully completed, foreground data self heal  is successfully completed,  from new-files-client-2 with 10 0  sizes - Pending matrix:  [ [ 0 3 ] [ 0 0
 ] ] on <gfid:5d382a1c-c8bd-4927-93f6-73eb2b9bfd84>
[2013-10-18 15:38:17.353288] I [afr-self-heal-common.c:2840:afr_log_self_heal_completion_status] 0-new-files-replicate-1:  metadata self heal  is succ
essfully completed, foreground data self heal  is successfully completed,  from new-files-client-2 with 10 0  sizes - Pending matrix:  [ [ 0 3 ] [ 0 0
 ] ] on <gfid:a746d051-56f3-472a-bc2f-72fbdd0a6a30>
[2013-10-18 15:38:17.389373] I [afr-self-heal-common.c:2840:afr_log_self_heal_completion_status] 0-new-files-replicate-1:  metadata self heal  is succ
essfully completed, foreground data self heal  is successfully completed,  from new-files-client-2 with 1 0  sizes - Pending matrix:  [ [ 0 3 ] [ 0 0 
] ] on <gfid:684b5efe-686f-4017-9fb8-2f9a9f73968f>


[4] from Node3:
[2013-10-18 10:38:59.698336] I [client-handshake.c:1486:client_setvolume_cbk] 0-existing-files-client-0: Server and Client lk-version numbers are not 
same, reopening the fds
[2013-10-18 10:38:59.698689] I [client-handshake.c:450:client_set_lk_version_cbk] 0-busy-files-client-1: Server lk version = 1
[2013-10-18 10:38:59.698731] I [client-handshake.c:450:client_set_lk_version_cbk] 0-new-files-client-3: Server lk version = 1
[2013-10-18 10:38:59.698771] I [client-handshake.c:450:client_set_lk_version_cbk] 0-existing-files-client-3: Server lk version = 1
[2013-10-18 10:38:59.698801] I [client-handshake.c:450:client_set_lk_version_cbk] 0-existing-files-client-1: Server lk version = 1
[2013-10-18 10:38:59.698828] I [client-handshake.c:450:client_set_lk_version_cbk] 0-new-files-client-0: Server lk version = 1
[2013-10-18 10:38:59.698867] I [client-handshake.c:1460:client_setvolume_cbk] 0-busy-files-client-0: 'server-pkg-version' key not found, handshaked wi
th older client
[2013-10-18 10:38:59.698885] I [client-handshake.c:1474:client_setvolume_cbk] 0-busy-files-client-0: Connected to 10.70.43.130:49152, attached to remo
te volume '/rhs/busy-files/brick1'.
[2013-10-18 10:38:59.698900] I [client-handshake.c:1486:client_setvolume_cbk] 0-busy-files-client-0: Server and Client lk-version numbers are not same
, reopening the fds
[2013-10-18 10:38:59.699838] I [client-handshake.c:450:client_set_lk_version_cbk] 0-busy-files-client-0: Server lk version = 1
[2013-10-18 10:38:59.699908] I [client-handshake.c:1460:client_setvolume_cbk] 0-busy-files-client-3: 'server-pkg-version' key not found, handshaked wi
th older client
[2013-10-18 10:38:59.699931] I [client-handshake.c:1474:client_setvolume_cbk] 0-busy-files-client-3: Connected to 10.70.43.183:49152, attached to remo
te volume '/rhs/busy-files/brick1'.
[2013-10-18 10:38:59.699947] I [client-handshake.c:1486:client_setvolume_cbk] 0-busy-files-client-3: Server and Client lk-version numbers are not same
, reopening the fds
[2013-10-18 10:38:59.700686] I [client-handshake.c:450:client_set_lk_version_cbk] 0-existing-files-client-0: Server lk version = 1
[2013-10-18 10:38:59.701578] I [client-handshake.c:450:client_set_lk_version_cbk] 0-busy-files-client-3: Server lk version = 1
[2013-10-18 10:38:59.704508] I [afr-self-heald.c:1180:afr_dir_exclusive_crawl] 0-new-files-replicate-1: Another crawl is in progress for new-files-cli
ent-2
[2013-10-18 10:38:59.704576] I [afr-self-heald.c:1180:afr_dir_exclusive_crawl] 0-busy-files-replicate-1: Another crawl is in progress for busy-files-c
lient-2
[2013-10-18 10:38:59.710034] I [afr-self-heald.c:1180:afr_dir_exclusive_crawl] 0-existing-files-replicate-1: Another crawl is in progress for existing
-files-client-2
[2013-10-18 11:05:44.215374] E [afr-self-heald.c:1243:afr_start_crawl] 0-busy-files-replicate-1: afr crawl failed for child 0 with  ret -1
[2013-10-18 11:05:45.980510] E [afr-self-heald.c:1243:afr_start_crawl] 0-busy-files-replicate-1: afr crawl failed for child 0 with  ret -1
[2013-10-18 11:05:54.474395] E [afr-self-heald.c:1243:afr_start_crawl] 0-busy-files-replicate-1: afr crawl failed for child 0 with  ret -1
[2013-10-18 11:05:55.795548] E [afr-self-heald.c:1243:afr_start_crawl] 0-busy-files-replicate-1: afr crawl failed for child 0 with  ret -1


[5] from node1:
[2013-10-18 13:18:05.761894] I [afr-self-heald.c:1180:afr_dir_exclusive_crawl] 0-busy-files-replicate-0: Another crawl is in progress for busy-files-c
lient-1
[2013-10-18 13:53:21.579566] W [socket.c:522:__socket_rwv] 0-existing-files-client-3: readv on 10.70.43.183:49153 failed (No data available)
[2013-10-18 13:53:21.579721] I [client.c:2103:client_rpc_notify] 0-existing-files-client-3: disconnected from 10.70.43.183:49153. Client process will 
keep trying to connect to glusterd until brick's port is available. 
[2013-10-18 13:53:21.579909] W [socket.c:522:__socket_rwv] 0-busy-files-client-3: readv on 10.70.43.183:49152 failed (No data available)
[2013-10-18 13:53:21.579991] I [client.c:2103:client_rpc_notify] 0-busy-files-client-3: disconnected from 10.70.43.183:49152. Client process will keep
 trying to connect to glusterd until brick's port is available. 
[2013-10-18 13:53:21.590481] W [socket.c:522:__socket_rwv] 0-new-files-client-3: readv on 10.70.43.183:49154 failed (No data available)
[2013-10-18 13:53:21.590573] I [client.c:2103:client_rpc_notify] 0-new-files-client-3: disconnected from 10.70.43.183:49154. Client process will keep 
trying to connect to glusterd until brick's port is available. 
[2013-10-18 13:53:32.095953] E [socket.c:2158:socket_connect_finish] 0-existing-files-client-3: connection to 10.70.43.183:24007 failed (Connection re
fused)
[2013-10-18 13:53:32.101389] E [socket.c:2158:socket_connect_finish] 0-busy-files-client-3: connection to 10.70.43.183:24007 failed (Connection refuse
d)
[2013-10-18 13:53:32.106842] E [socket.c:2158:socket_connect_finish] 0-new-files-client-3: connection to 10.70.43.183:24007 failed (Connection refused
)
[2013-10-18 14:15:59.661282] W [socket.c:522:__socket_rwv] 0-existing-files-client-3: readv on 10.70.43.183:24007 failed (Connection reset by peer)
[2013-10-18 14:15:59.662153] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x164) [0x37b7e0e0f4] (-->/usr/lib64/
libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x37b7e0dc33] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x37b7e0db4e]))) 0-existing-fil
es-client-3: forced unwinding frame type(GF-DUMP) op(DUMP(1)) called at 2013-10-18 14:15:52.719387 (xid=0x10x)
[2013-10-18 14:15:59.662184] W [client-handshake.c:1863:client_dump_version_cbk] 0-existing-files-client-3: received RPC status error
[2013-10-18 14:15:59.662228] I [client.c:2103:client_rpc_notify] 0-existing-files-client-3: disconnected from 10.70.43.183:24007. Client process will 
keep trying to connect to glusterd until brick's port is available. 
[2013-10-18 14:15:59.662261] W [socket.c:522:__socket_rwv] 0-busy-files-client-3: readv on 10.70.43.183:24007 failed (Connection reset by peer)
[2013-10-18 14:15:59.662370] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x164) [0x37b7e0e0f4] (-->/usr/lib64/
libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x37b7e0dc33] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x37b7e0db4e]))) 0-busy-files-c
lient-3: forced unwinding frame type(GF-DUMP) op(DUMP(1)) called at 2013-10-18 14:15:52.725176 (xid=0x10x)
[2013-10-18 14:15:59.662389] W [client-handshake.c:1863:client_dump_version_cbk] 0-busy-files-client-3: received RPC status error
[2013-10-18 14:15:59.662414] I [client.c:2103:client_rpc_notify] 0-busy-files-client-3: disconnected from 10.70.43.183:24007. Client process will keep
 trying to connect to glusterd until brick's port is available. 
[2013-10-18 14:15:59.662439] W [socket.c:522:__socket_rwv] 0-new-files-client-3: readv on 10.70.43.183:24007 failed (Connection reset by peer)
[2013-10-18 14:15:59.662498] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x164) [0x37b7e0e0f4] (-->/usr/lib64/
libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x37b7e0dc33] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x37b7e0db4e]))) 0-new-files-cl
ient-3: forced unwinding frame type(GF-DUMP) op(DUMP(1)) called at 2013-10-18 14:15:52.730450 (xid=0x10x)
[2013-10-18 14:15:59.662547] W [client-handshake.c:1863:client_dump_version_cbk] 0-new-files-client-3: received RPC status error
[2013-10-18 14:15:59.662572] I [client.c:2103:client_rpc_notify] 0-new-files-client-3: disconnected from 10.70.43.183:24007. Client process will keep 
trying to connect to glusterd until brick's port is available. 
[2013-10-18 14:16:01.740610] E [socket.c:2158:socket_connect_finish] 0-existing-files-client-3: connection to 10.70.43.183:24007 failed (Connection re
fused)

--- Additional comment from Gowrishankar Rajaiyan on 2013-10-18 15:35:25 EDT ---



--- Additional comment from Gowrishankar Rajaiyan on 2013-10-18 15:37:54 EDT ---



--- Additional comment from Gowrishankar Rajaiyan on 2013-10-18 15:40:02 EDT ---



--- Additional comment from Gowrishankar Rajaiyan on 2013-10-18 15:41:42 EDT ---



--- Additional comment from Pranith Kumar K on 2013-10-22 02:23:00 EDT ---

After debugging session with Gowrishankar, we came to conclusion that the issue happens because indices directory is not present on some of the bricks. When at least one file is created on the brick the logs stop coming. For a fix the indices directory needs to be created as soon as the brick comes up.

--- Additional comment from Gowrishankar Rajaiyan on 2013-10-22 02:30:40 EDT ---

This issue can be hit in a pre-production environment, should be mentioned as a pre-requisite for upgrade, ensure that all bricks have data (in other words: indices directory is created) before performing any rolling upgrade.

--- Additional comment from Pavithra on 2013-10-23 07:12:58 EDT ---

Hi KP,

Can you please review the edited doc text for technical accuracy?

--- Additional comment from Vivek Agarwal on 2013-11-14 06:27:54 EST ---

Moving the known issues to Doc team, to be documented in release notes for U1

--- Additional comment from Vivek Agarwal on 2013-11-14 06:29:42 EST ---

Moving the known issues to Doc team, to be documented in release notes for U1

--- Additional comment from Vivek Agarwal on 2013-11-14 06:30:15 EST ---

Moving the known issues to Doc team, to be documented in release notes for U1

--- Additional comment from Pavithra on 2013-11-25 02:16:57 EST ---

I've documented this as a known issue in the Big Bend Update 1 Release Notes. Here is the link:

http://documentation-devel.engineering.redhat.com/docs/en-US/Red_Hat_Storage/2.1/html/2.1_Update_1_Release_Notes/chap-Documentation-2.1_Update_1_Release_Notes-Known_Issues.html

Comment 1 Anuradha 2013-11-25 11:41:07 UTC
Posted for review on :
http://review.gluster.org/#/c/6343/

Comment 2 Anand Avati 2013-11-26 06:18:08 UTC
REVIEW: http://review.gluster.org/6343 (features/index : Creation of indices directory as soon as brick is up.) posted (#2) for review on master by Anuradha Talur (atalur)

Comment 3 Anand Avati 2013-11-27 09:32:01 UTC
REVIEW: http://review.gluster.org/6343 (features/index : Creation of indices directory as soon as brick is up.) posted (#3) for review on master by Anuradha Talur (atalur)

Comment 4 Anand Avati 2013-11-28 16:47:34 UTC
COMMIT: http://review.gluster.org/6343 committed in master by Vijay Bellur (vbellur) 
------
commit 3c5529332779ad6cd169370193978ac24974320d
Author: Anuradha <atalur>
Date:   Mon Nov 25 14:48:23 2013 +0530

    features/index : Creation of indices directory as soon as brick is up.
    
    Missing indices directory in the bricks leads to unwanted log messages.
    Therefore, indices directory needs to be created as soon as the brick
    comes up.
    This patch results in creation of indices/xattrop directory as required.
    
    Also includes a testcase to test the same.
    
    Change-Id: Ic2aedce00c6c1fb24929ca41f6085aed28709d2c
    BUG: 1034085
    Signed-off-by: Anuradha <atalur>
    Reviewed-on: http://review.gluster.org/6343
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    Reviewed-by: Vijay Bellur <vbellur>

Comment 5 Anand Avati 2013-12-04 16:16:49 UTC
REVIEW: http://review.gluster.org/6426 (index: fix the order of initialization) posted (#1) for review on master by Anand Avati (avati)

Comment 6 Anand Avati 2013-12-04 22:31:54 UTC
COMMIT: http://review.gluster.org/6426 committed in master by Anand Avati (avati) 
------
commit 64e3fada97bdb6dc4b885d6122e56eb93273503c
Author: Anand Avati <avati>
Date:   Wed Nov 27 12:54:41 2013 -0800

    index: fix the order of initialization
    
    thread spawning must be the last action. if not, failures after
    thread creation can set this->private to NULL and cause index_worker
    to segfault
    
    Change-Id: I71c85c25d7d6d1ed5f8d9c951db0262b8a3f1d90
    BUG: 1034085
    Signed-off-by: Anand Avati <avati>
    Reviewed-on: http://review.gluster.org/6426
    Reviewed-by: Kaleb KEITHLEY <kkeithle>
    Tested-by: Gluster Build System <jenkins.com>

Comment 7 Niels de Vos 2014-09-22 12:32:56 UTC
A beta release for GlusterFS 3.6.0 has been released. Please verify if the release solves this bug report for you. In case the glusterfs-3.6.0beta1 release does not have a resolution for this issue, leave a comment in this bug and move the status to ASSIGNED. If this release fixes the problem for you, leave a note and change the status to VERIFIED.

Packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update (possibly an "updates-testing" repository) infrastructure for your distribution.

[1] http://supercolony.gluster.org/pipermail/gluster-users/2014-September/018836.html
[2] http://supercolony.gluster.org/pipermail/gluster-users/

Comment 8 Niels de Vos 2014-11-11 08:24:48 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.6.1, please reopen this bug report.

glusterfs-3.6.1 has been announced [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://supercolony.gluster.org/pipermail/gluster-users/2014-November/019410.html
[2] http://supercolony.gluster.org/mailman/listinfo/gluster-users


Note You need to log in before you can comment on or make changes to this bug.