Hide Forgot
+++ This bug was initially created as a clone of Bug #1020995 +++ Description of problem: Testing upgrade scenario as specified in bug 1009151#c3. I setup three 2x2 volumes[1] and do FOP[2] from multiple clients. I take one node down by stopping glusterd and pkill glusterfs processes. Post upgrade when I start glusterd, self-heal triggers as expected, note that I have only ~35G of data across all the 3 volumes. I notice self-heal success messages[3] for an hour or so as I continue to do FOPS[2] from the client and then see self-heal error messages[4], only vol heal info, vol status, vol info commands were executed during this entire activity. Version-Release number of selected component (if applicable): 3 nodes: glusterfs-server-3.4.0.33rhs-1.el6rhs.x86_64 1 node: upgraded to glusterfs-server-3.4.0.35rhs-1.el6rhs.x86_64 How reproducible: Testing rolling upgrade and hit this the first time itself. Steps to Reproduce: 1. Create 3 2x2 volumes. 2. Mount them on multiple clients. 3. Do FOPS[2] and ensure to have around 25-30G of data. 4. Bring down node4 (glusterd stop, pkill glusterfs processes) as you continue to do FOPS. 5. Perform upgrade activity. (yum update) 6. Reboot node4. 7. service glusterd start (since I chkconfig'ed it off before reboot). 8. Observe /var/log/glusterfs/glustershd.log log messages. Actual results: Initially self-heal is successful, then after an hour or so it errors. Expected results: self-heal should keep running successfully. Additional info: [1] [root@ninja-rhs2 ~]# gluster vol info Volume Name: existing-files Type: Distributed-Replicate Volume ID: d15fd04b-5664-4eb4-8ad1-1dfb986131c3 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.43.130:/rhs/existing-files/brick1 Brick2: 10.70.43.26:/rhs/existing-files/brick1 Brick3: 10.70.43.103:/rhs/existing-files/brick1 Brick4: 10.70.43.183:/rhs/existing-files/brick1 Volume Name: busy-files Type: Distributed-Replicate Volume ID: 5385f5c7-7fc1-462e-8aad-610510666adc Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.43.130:/rhs/busy-files/brick1 Brick2: 10.70.43.26:/rhs/busy-files/brick1 Brick3: 10.70.43.103:/rhs/busy-files/brick1 Brick4: 10.70.43.183:/rhs/busy-files/brick1 Options Reconfigured: storage.owner-uid: 107 storage.owner-gid: 107 performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off cluster.eager-lock: enable network.remote-dio: enable Volume Name: new-files Type: Distributed-Replicate Volume ID: 84567149-7dee-4680-9b2f-230df247fbef Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.43.130:/rhs/new-files/brick1 Brick2: 10.70.43.26:/rhs/new-files/brick1 Brick3: 10.70.43.103:/rhs/new-files/brick1 Brick4: 10.70.43.183:/rhs/new-files/brick1 [root@ninja-rhs2 ~]# [2] Create files in directory MNT_PT in the directory layout of 10 directories in each level, and 10 directories depth and in each directory, create 1000 sparse files of size 100K and delete some. [3] from node3: [2013-10-18 15:38:17.318016] I [afr-self-heal-common.c:2840:afr_log_self_heal_completion_status] 0-new-files-replicate-1: metadata self heal is succ essfully completed, foreground data self heal is successfully completed, from new-files-client-2 with 10 0 sizes - Pending matrix: [ [ 0 3 ] [ 0 0 ] ] on <gfid:5d382a1c-c8bd-4927-93f6-73eb2b9bfd84> [2013-10-18 15:38:17.353288] I [afr-self-heal-common.c:2840:afr_log_self_heal_completion_status] 0-new-files-replicate-1: metadata self heal is succ essfully completed, foreground data self heal is successfully completed, from new-files-client-2 with 10 0 sizes - Pending matrix: [ [ 0 3 ] [ 0 0 ] ] on <gfid:a746d051-56f3-472a-bc2f-72fbdd0a6a30> [2013-10-18 15:38:17.389373] I [afr-self-heal-common.c:2840:afr_log_self_heal_completion_status] 0-new-files-replicate-1: metadata self heal is succ essfully completed, foreground data self heal is successfully completed, from new-files-client-2 with 1 0 sizes - Pending matrix: [ [ 0 3 ] [ 0 0 ] ] on <gfid:684b5efe-686f-4017-9fb8-2f9a9f73968f> [4] from Node3: [2013-10-18 10:38:59.698336] I [client-handshake.c:1486:client_setvolume_cbk] 0-existing-files-client-0: Server and Client lk-version numbers are not same, reopening the fds [2013-10-18 10:38:59.698689] I [client-handshake.c:450:client_set_lk_version_cbk] 0-busy-files-client-1: Server lk version = 1 [2013-10-18 10:38:59.698731] I [client-handshake.c:450:client_set_lk_version_cbk] 0-new-files-client-3: Server lk version = 1 [2013-10-18 10:38:59.698771] I [client-handshake.c:450:client_set_lk_version_cbk] 0-existing-files-client-3: Server lk version = 1 [2013-10-18 10:38:59.698801] I [client-handshake.c:450:client_set_lk_version_cbk] 0-existing-files-client-1: Server lk version = 1 [2013-10-18 10:38:59.698828] I [client-handshake.c:450:client_set_lk_version_cbk] 0-new-files-client-0: Server lk version = 1 [2013-10-18 10:38:59.698867] I [client-handshake.c:1460:client_setvolume_cbk] 0-busy-files-client-0: 'server-pkg-version' key not found, handshaked wi th older client [2013-10-18 10:38:59.698885] I [client-handshake.c:1474:client_setvolume_cbk] 0-busy-files-client-0: Connected to 10.70.43.130:49152, attached to remo te volume '/rhs/busy-files/brick1'. [2013-10-18 10:38:59.698900] I [client-handshake.c:1486:client_setvolume_cbk] 0-busy-files-client-0: Server and Client lk-version numbers are not same , reopening the fds [2013-10-18 10:38:59.699838] I [client-handshake.c:450:client_set_lk_version_cbk] 0-busy-files-client-0: Server lk version = 1 [2013-10-18 10:38:59.699908] I [client-handshake.c:1460:client_setvolume_cbk] 0-busy-files-client-3: 'server-pkg-version' key not found, handshaked wi th older client [2013-10-18 10:38:59.699931] I [client-handshake.c:1474:client_setvolume_cbk] 0-busy-files-client-3: Connected to 10.70.43.183:49152, attached to remo te volume '/rhs/busy-files/brick1'. [2013-10-18 10:38:59.699947] I [client-handshake.c:1486:client_setvolume_cbk] 0-busy-files-client-3: Server and Client lk-version numbers are not same , reopening the fds [2013-10-18 10:38:59.700686] I [client-handshake.c:450:client_set_lk_version_cbk] 0-existing-files-client-0: Server lk version = 1 [2013-10-18 10:38:59.701578] I [client-handshake.c:450:client_set_lk_version_cbk] 0-busy-files-client-3: Server lk version = 1 [2013-10-18 10:38:59.704508] I [afr-self-heald.c:1180:afr_dir_exclusive_crawl] 0-new-files-replicate-1: Another crawl is in progress for new-files-cli ent-2 [2013-10-18 10:38:59.704576] I [afr-self-heald.c:1180:afr_dir_exclusive_crawl] 0-busy-files-replicate-1: Another crawl is in progress for busy-files-c lient-2 [2013-10-18 10:38:59.710034] I [afr-self-heald.c:1180:afr_dir_exclusive_crawl] 0-existing-files-replicate-1: Another crawl is in progress for existing -files-client-2 [2013-10-18 11:05:44.215374] E [afr-self-heald.c:1243:afr_start_crawl] 0-busy-files-replicate-1: afr crawl failed for child 0 with ret -1 [2013-10-18 11:05:45.980510] E [afr-self-heald.c:1243:afr_start_crawl] 0-busy-files-replicate-1: afr crawl failed for child 0 with ret -1 [2013-10-18 11:05:54.474395] E [afr-self-heald.c:1243:afr_start_crawl] 0-busy-files-replicate-1: afr crawl failed for child 0 with ret -1 [2013-10-18 11:05:55.795548] E [afr-self-heald.c:1243:afr_start_crawl] 0-busy-files-replicate-1: afr crawl failed for child 0 with ret -1 [5] from node1: [2013-10-18 13:18:05.761894] I [afr-self-heald.c:1180:afr_dir_exclusive_crawl] 0-busy-files-replicate-0: Another crawl is in progress for busy-files-c lient-1 [2013-10-18 13:53:21.579566] W [socket.c:522:__socket_rwv] 0-existing-files-client-3: readv on 10.70.43.183:49153 failed (No data available) [2013-10-18 13:53:21.579721] I [client.c:2103:client_rpc_notify] 0-existing-files-client-3: disconnected from 10.70.43.183:49153. Client process will keep trying to connect to glusterd until brick's port is available. [2013-10-18 13:53:21.579909] W [socket.c:522:__socket_rwv] 0-busy-files-client-3: readv on 10.70.43.183:49152 failed (No data available) [2013-10-18 13:53:21.579991] I [client.c:2103:client_rpc_notify] 0-busy-files-client-3: disconnected from 10.70.43.183:49152. Client process will keep trying to connect to glusterd until brick's port is available. [2013-10-18 13:53:21.590481] W [socket.c:522:__socket_rwv] 0-new-files-client-3: readv on 10.70.43.183:49154 failed (No data available) [2013-10-18 13:53:21.590573] I [client.c:2103:client_rpc_notify] 0-new-files-client-3: disconnected from 10.70.43.183:49154. Client process will keep trying to connect to glusterd until brick's port is available. [2013-10-18 13:53:32.095953] E [socket.c:2158:socket_connect_finish] 0-existing-files-client-3: connection to 10.70.43.183:24007 failed (Connection re fused) [2013-10-18 13:53:32.101389] E [socket.c:2158:socket_connect_finish] 0-busy-files-client-3: connection to 10.70.43.183:24007 failed (Connection refuse d) [2013-10-18 13:53:32.106842] E [socket.c:2158:socket_connect_finish] 0-new-files-client-3: connection to 10.70.43.183:24007 failed (Connection refused ) [2013-10-18 14:15:59.661282] W [socket.c:522:__socket_rwv] 0-existing-files-client-3: readv on 10.70.43.183:24007 failed (Connection reset by peer) [2013-10-18 14:15:59.662153] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x164) [0x37b7e0e0f4] (-->/usr/lib64/ libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x37b7e0dc33] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x37b7e0db4e]))) 0-existing-fil es-client-3: forced unwinding frame type(GF-DUMP) op(DUMP(1)) called at 2013-10-18 14:15:52.719387 (xid=0x10x) [2013-10-18 14:15:59.662184] W [client-handshake.c:1863:client_dump_version_cbk] 0-existing-files-client-3: received RPC status error [2013-10-18 14:15:59.662228] I [client.c:2103:client_rpc_notify] 0-existing-files-client-3: disconnected from 10.70.43.183:24007. Client process will keep trying to connect to glusterd until brick's port is available. [2013-10-18 14:15:59.662261] W [socket.c:522:__socket_rwv] 0-busy-files-client-3: readv on 10.70.43.183:24007 failed (Connection reset by peer) [2013-10-18 14:15:59.662370] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x164) [0x37b7e0e0f4] (-->/usr/lib64/ libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x37b7e0dc33] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x37b7e0db4e]))) 0-busy-files-c lient-3: forced unwinding frame type(GF-DUMP) op(DUMP(1)) called at 2013-10-18 14:15:52.725176 (xid=0x10x) [2013-10-18 14:15:59.662389] W [client-handshake.c:1863:client_dump_version_cbk] 0-busy-files-client-3: received RPC status error [2013-10-18 14:15:59.662414] I [client.c:2103:client_rpc_notify] 0-busy-files-client-3: disconnected from 10.70.43.183:24007. Client process will keep trying to connect to glusterd until brick's port is available. [2013-10-18 14:15:59.662439] W [socket.c:522:__socket_rwv] 0-new-files-client-3: readv on 10.70.43.183:24007 failed (Connection reset by peer) [2013-10-18 14:15:59.662498] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x164) [0x37b7e0e0f4] (-->/usr/lib64/ libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x37b7e0dc33] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x37b7e0db4e]))) 0-new-files-cl ient-3: forced unwinding frame type(GF-DUMP) op(DUMP(1)) called at 2013-10-18 14:15:52.730450 (xid=0x10x) [2013-10-18 14:15:59.662547] W [client-handshake.c:1863:client_dump_version_cbk] 0-new-files-client-3: received RPC status error [2013-10-18 14:15:59.662572] I [client.c:2103:client_rpc_notify] 0-new-files-client-3: disconnected from 10.70.43.183:24007. Client process will keep trying to connect to glusterd until brick's port is available. [2013-10-18 14:16:01.740610] E [socket.c:2158:socket_connect_finish] 0-existing-files-client-3: connection to 10.70.43.183:24007 failed (Connection re fused) --- Additional comment from Gowrishankar Rajaiyan on 2013-10-18 15:35:25 EDT --- --- Additional comment from Gowrishankar Rajaiyan on 2013-10-18 15:37:54 EDT --- --- Additional comment from Gowrishankar Rajaiyan on 2013-10-18 15:40:02 EDT --- --- Additional comment from Gowrishankar Rajaiyan on 2013-10-18 15:41:42 EDT --- --- Additional comment from Pranith Kumar K on 2013-10-22 02:23:00 EDT --- After debugging session with Gowrishankar, we came to conclusion that the issue happens because indices directory is not present on some of the bricks. When at least one file is created on the brick the logs stop coming. For a fix the indices directory needs to be created as soon as the brick comes up. --- Additional comment from Gowrishankar Rajaiyan on 2013-10-22 02:30:40 EDT --- This issue can be hit in a pre-production environment, should be mentioned as a pre-requisite for upgrade, ensure that all bricks have data (in other words: indices directory is created) before performing any rolling upgrade. --- Additional comment from Pavithra on 2013-10-23 07:12:58 EDT --- Hi KP, Can you please review the edited doc text for technical accuracy? --- Additional comment from Vivek Agarwal on 2013-11-14 06:27:54 EST --- Moving the known issues to Doc team, to be documented in release notes for U1 --- Additional comment from Vivek Agarwal on 2013-11-14 06:29:42 EST --- Moving the known issues to Doc team, to be documented in release notes for U1 --- Additional comment from Vivek Agarwal on 2013-11-14 06:30:15 EST --- Moving the known issues to Doc team, to be documented in release notes for U1 --- Additional comment from Pavithra on 2013-11-25 02:16:57 EST --- I've documented this as a known issue in the Big Bend Update 1 Release Notes. Here is the link: http://documentation-devel.engineering.redhat.com/docs/en-US/Red_Hat_Storage/2.1/html/2.1_Update_1_Release_Notes/chap-Documentation-2.1_Update_1_Release_Notes-Known_Issues.html
Posted for review on : http://review.gluster.org/#/c/6343/
REVIEW: http://review.gluster.org/6343 (features/index : Creation of indices directory as soon as brick is up.) posted (#2) for review on master by Anuradha Talur (atalur)
REVIEW: http://review.gluster.org/6343 (features/index : Creation of indices directory as soon as brick is up.) posted (#3) for review on master by Anuradha Talur (atalur)
COMMIT: http://review.gluster.org/6343 committed in master by Vijay Bellur (vbellur) ------ commit 3c5529332779ad6cd169370193978ac24974320d Author: Anuradha <atalur> Date: Mon Nov 25 14:48:23 2013 +0530 features/index : Creation of indices directory as soon as brick is up. Missing indices directory in the bricks leads to unwanted log messages. Therefore, indices directory needs to be created as soon as the brick comes up. This patch results in creation of indices/xattrop directory as required. Also includes a testcase to test the same. Change-Id: Ic2aedce00c6c1fb24929ca41f6085aed28709d2c BUG: 1034085 Signed-off-by: Anuradha <atalur> Reviewed-on: http://review.gluster.org/6343 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu> Reviewed-by: Vijay Bellur <vbellur>
REVIEW: http://review.gluster.org/6426 (index: fix the order of initialization) posted (#1) for review on master by Anand Avati (avati)
COMMIT: http://review.gluster.org/6426 committed in master by Anand Avati (avati) ------ commit 64e3fada97bdb6dc4b885d6122e56eb93273503c Author: Anand Avati <avati> Date: Wed Nov 27 12:54:41 2013 -0800 index: fix the order of initialization thread spawning must be the last action. if not, failures after thread creation can set this->private to NULL and cause index_worker to segfault Change-Id: I71c85c25d7d6d1ed5f8d9c951db0262b8a3f1d90 BUG: 1034085 Signed-off-by: Anand Avati <avati> Reviewed-on: http://review.gluster.org/6426 Reviewed-by: Kaleb KEITHLEY <kkeithle> Tested-by: Gluster Build System <jenkins.com>
A beta release for GlusterFS 3.6.0 has been released. Please verify if the release solves this bug report for you. In case the glusterfs-3.6.0beta1 release does not have a resolution for this issue, leave a comment in this bug and move the status to ASSIGNED. If this release fixes the problem for you, leave a note and change the status to VERIFIED. Packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update (possibly an "updates-testing" repository) infrastructure for your distribution. [1] http://supercolony.gluster.org/pipermail/gluster-users/2014-September/018836.html [2] http://supercolony.gluster.org/pipermail/gluster-users/
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.6.1, please reopen this bug report. glusterfs-3.6.1 has been announced [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://supercolony.gluster.org/pipermail/gluster-users/2014-November/019410.html [2] http://supercolony.gluster.org/mailman/listinfo/gluster-users