Hide Forgot
Description of problem: ======================= In a 1 x 2 replicate volume while running ping_pong on a file one of the brick process went offline. Killed all the mount process and unmounted the mount points. When brought back the brick online self-healed happened from sink to source. i.e. the brick which was offline and come back online to brick which was always online. glustershd log: ================= [2013-09-02 11:43:09.438655] I [afr-self-heal-common.c:2827:afr_log_self_heal_completion_status] 0-vol_dis_1_rep_2-replicate-0: foreground data self heal is successfully completed, from vol_dis_1_rep_2-client-1 with 7 7 sizes - Pending matrix: [ [ 2 2131 ] [ 1 1 ] ] on <gfid:4e060386-0f71-430b-b511-bc1d7bba9ca4> Version-Release number of selected component (if applicable): ============================================================= glusterfs 3.4.0.30rhs built on Aug 30 2013 08:15:37 How reproducible: ===================== Intermittent. Tried to recreate the issue 3 times. Seen this issue once until now. Steps to Reproduce: ====================== 1. Create replicate volume ( 1 x 2 ). Start the volume. root@king [Sep-02-2013-12:31:44] >gluster v info Volume Name: vol_dis_1_rep_2 Type: Replicate Volume ID: 15a1734e-8485-4ef2-a82b-ddafff2fc97e Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: hicks.lab.eng.blr.redhat.com:/rhs/bricks/vol_dis_1_rep_2_b0 Brick2: king.lab.eng.blr.redhat.com:/rhs/bricks/vol_dis_1_rep_2_b1 Options Reconfigured: performance.write-behind: on cluster.self-heal-daemon: on 2. Create 4 fuse mounts. 3. From all the mount points start ping_pong : "ping_pong -rw ping_pong_testfile 6" 4. While ping_pong is in progress kill a brick (brick1) 5. After some time, kill all the mount process and unmount mount points. 6. Bring back the brick online. Actual results: =============== self-heal of files happening in reverse direction. i.e from sink to source . In our case self-healed happened from brick1 (which needed self-heal) to brick0 (always online brick) Expected results: ================= self-heal should happen from source to sink. Additional info: +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ brick0 commands execution history: +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ root@hicks [Sep-02-2013-11:39:45] >md5sum /rhs/bricks/vol_dis_1_rep_2_b0/testfile 5c59e2563fd88fba468e0b6d6c51abfb /rhs/bricks/vol_dis_1_rep_2_b0/testfile root@hicks [Sep-02-2013-11:41:09] >ls -l /rhs/bricks/vol_dis_1_rep_2_b0/.glusterfs/indices/xattrop/ total 0 ---------- 2 root root 0 Sep 2 10:59 4e060386-0f71-430b-b511-bc1d7bba9ca4 ---------- 2 root root 0 Sep 2 10:59 xattrop-47343724-cd89-4a12-bb16-846a4c59f6ea root@hicks [Sep-02-2013-11:41:16] > root@hicks [Sep-02-2013-11:41:30] >getfattr -d -e hex -m . /rhs/bricks/vol_dis_1_rep_2_b0/testfile getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/vol_dis_1_rep_2_b0/testfile trusted.afr.vol_dis_1_rep_2-client-0=0x000000020000000000000000 trusted.afr.vol_dis_1_rep_2-client-1=0x000008530000000000000000 trusted.gfid=0x4e0603860f71430bb511bc1d7bba9ca4 root@hicks [Sep-02-2013-11:41:35] >md5sum /rhs/bricks/vol_dis_1_rep_2_b0/testfile 5c59e2563fd88fba468e0b6d6c51abfb /rhs/bricks/vol_dis_1_rep_2_b0/testfile root@hicks [Sep-02-2013-11:42:52] >getfattr -d -e hex -m . /rhs/bricks/vol_dis_1_rep_2_b0/testfile getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/vol_dis_1_rep_2_b0/testfile trusted.afr.vol_dis_1_rep_2-client-0=0x000000020000000000000000 trusted.afr.vol_dis_1_rep_2-client-1=0x000008530000000000000000 trusted.gfid=0x4e0603860f71430bb511bc1d7bba9ca4 root@hicks [Sep-02-2013-11:43:12] > root@hicks [Sep-02-2013-11:43:12] >getfattr -d -e hex -m . /rhs/bricks/vol_dis_1_rep_2_b0/testfile getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/vol_dis_1_rep_2_b0/testfile trusted.afr.vol_dis_1_rep_2-client-0=0x000000000000000000000000 trusted.afr.vol_dis_1_rep_2-client-1=0x000000000000000000000000 trusted.gfid=0x4e0603860f71430bb511bc1d7bba9ca4 root@hicks [Sep-02-2013-11:43:13] >md5sum /rhs/bricks/vol_dis_1_rep_2_b0/testfile 4d1c15c19e60949c3c584a9f051afd1d /rhs/bricks/vol_dis_1_rep_2_b0/testfile root@hicks [Sep-02-2013-12:18:36] >gluster v status Status of volume: vol_dis_1_rep_2 Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick hicks.lab.eng.blr.redhat.com:/rhs/bricks/vol_dis_ 1_rep_2_b0 49152 Y 2008 Brick king.lab.eng.blr.redhat.com:/rhs/bricks/vol_dis_1 _rep_2_b1 49152 Y 4295 NFS Server on localhost 2049 Y 5607 Self-heal Daemon on localhost N/A Y 5613 NFS Server on 10.70.34.119 2049 Y 6111 Self-heal Daemon on 10.70.34.119 N/A Y 6113 There are no active volume tasks ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ brick1 commands execution history: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ root@king [Sep-02-2013-11:39:44] >md5sum /rhs/bricks/vol_dis_1_rep_2_b1/testfile 4d1c15c19e60949c3c584a9f051afd1d /rhs/bricks/vol_dis_1_rep_2_b1/testfile root@king [Sep-02-2013-11:39:47] > root@king [Sep-02-2013-11:41:09] >ls -l /rhs/bricks/vol_dis_1_rep_2_b1/.glusterfs/indices/xattrop/ total 0 ---------- 2 root root 0 Sep 2 10:51 4e060386-0f71-430b-b511-bc1d7bba9ca4 ---------- 2 root root 0 Sep 2 10:51 xattrop-ee1b060f-0324-459d-9b40-42f7fc42140c root@king [Sep-02-2013-11:41:30] >getfattr -d -e hex -m . /rhs/bricks/vol_dis_1_rep_2_b1/testfile getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/vol_dis_1_rep_2_b1/testfile trusted.afr.vol_dis_1_rep_2-client-0=0x000000010000000000000000 trusted.afr.vol_dis_1_rep_2-client-1=0x000000010000000000000000 trusted.gfid=0x4e0603860f71430bb511bc1d7bba9ca4 root@king [Sep-02-2013-11:41:35] >md5sum /rhs/bricks/vol_dis_1_rep_2_b1/testfile 4d1c15c19e60949c3c584a9f051afd1d /rhs/bricks/vol_dis_1_rep_2_b1/testfile root@king [Sep-02-2013-11:41:48] > root@king [Sep-02-2013-11:41:49] >gluster v status Status of volume: vol_dis_1_rep_2 Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick hicks.lab.eng.blr.redhat.com:/rhs/bricks/vol_dis_ 1_rep_2_b0 49152 Y 2008 Brick king.lab.eng.blr.redhat.com:/rhs/bricks/vol_dis_1 _rep_2_b1 N/A N 2330 NFS Server on localhost 2049 Y 6111 Self-heal Daemon on localhost N/A Y 6113 NFS Server on hicks.lab.eng.blr.redhat.com 2049 Y 5607 Self-heal Daemon on hicks.lab.eng.blr.redhat.com N/A Y 5613 There are no active volume tasks root@king [Sep-02-2013-11:42:11] > root@king [Sep-02-2013-11:42:12] > root@king [Sep-02-2013-11:42:50] > root@king [Sep-02-2013-11:42:50] > root@king [Sep-02-2013-11:42:51] > root@king [Sep-02-2013-11:42:52] >getfattr -d -e hex -m . /rhs/bricks/vol_dis_1_rep_2_b1/testfile getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/vol_dis_1_rep_2_b1/testfile trusted.afr.vol_dis_1_rep_2-client-0=0x000000010000000000000000 trusted.afr.vol_dis_1_rep_2-client-1=0x000000010000000000000000 trusted.gfid=0x4e0603860f71430bb511bc1d7bba9ca4 root@king [Sep-02-2013-11:42:54] >/usr/sbin/glusterfsd -s king.lab.eng.blr.redhat.com --volfile-id vol_dis_1_rep_2.king.lab.eng.blr.redhat.com.rhs-bricks-vol_dis_1_rep_2_b1 -p /var/lib/glusterd/vols/vol_dis_1_rep_2/run/king.lab.eng.blr.redhat.com-rhs-bricks-vol_dis_1_rep_2_b1.pid -S /var/run/d74ff90654cfd5541a156beb5002b9bd.socket --brick-name /rhs/bricks/vol_dis_1_rep_2_b1 -l /var/log/glusterfs/bricks/rhs-bricks-vol_dis_1_rep_2_b1.log --xlator-option *-posix.glusterd-uuid=ecb4cd8f-4c25-48e3-86a1-af4dd2cf917b --brick-port 49152 --xlator-option vol_dis_1_rep_2-server.listen-port=49152 root@king [Sep-02-2013-11:43:08] >gluster v status Status of volume: vol_dis_1_rep_2 Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick hicks.lab.eng.blr.redhat.com:/rhs/bricks/vol_dis_ 1_rep_2_b0 49152 Y 2008 Brick king.lab.eng.blr.redhat.com:/rhs/bricks/vol_dis_1 _rep_2_b1 49152 Y 3613 NFS Server on localhost 2049 Y 6111 Self-heal Daemon on localhost N/A Y 6113 NFS Server on hicks.lab.eng.blr.redhat.com 2049 Y 5607 Self-heal Daemon on hicks.lab.eng.blr.redhat.com N/A Y 5613 There are no active volume tasks root@king [Sep-02-2013-11:43:12] >getfattr -d -e hex -m . /rhs/bricks/vol_dis_1_rep_2_b1/testfile getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/vol_dis_1_rep_2_b1/testfile trusted.afr.vol_dis_1_rep_2-client-0=0x000000000000000000000000 trusted.afr.vol_dis_1_rep_2-client-1=0x000000000000000000000000 trusted.gfid=0x4e0603860f71430bb511bc1d7bba9ca4 root@king [Sep-02-2013-11:43:13] >md5sum /rhs/bricks/vol_dis_1_rep_2_b1/testfile 4d1c15c19e60949c3c584a9f051afd1d /rhs/bricks/vol_dis_1_rep_2_b1/testfile root@king [Sep-02-2013-11:43:16] > root@king [Sep-02-2013-11:45:11] > root@king [Sep-02-2013-11:45:11] >gluster v heal `gluster v list` info Gathering list of entries to be healed on volume vol_dis_1_rep_2 has been successful Brick hicks.lab.eng.blr.redhat.com:/rhs/bricks/vol_dis_1_rep_2_b0 Number of entries: 0 Brick king.lab.eng.blr.redhat.com:/rhs/bricks/vol_dis_1_rep_2_b1 Number of entries: 0 root@king [Sep-02-2013-11:45:20] >gluster v heal `gluster v list` info healed Gathering list of healed entries on volume vol_dis_1_rep_2 has been successful Brick hicks.lab.eng.blr.redhat.com:/rhs/bricks/vol_dis_1_rep_2_b0 Number of entries: 0 Brick king.lab.eng.blr.redhat.com:/rhs/bricks/vol_dis_1_rep_2_b1 Number of entries: 1 at path on brick ----------------------------------- 2013-09-02 11:43:09 <gfid:4e060386-0f71-430b-b511-bc1d7bba9ca4> root@king [Sep-02-2013-12:18:36] >gluster v status Status of volume: vol_dis_1_rep_2 Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick hicks.lab.eng.blr.redhat.com:/rhs/bricks/vol_dis_ 1_rep_2_b0 49152 Y 2008 Brick king.lab.eng.blr.redhat.com:/rhs/bricks/vol_dis_1 _rep_2_b1 49152 Y 4295 NFS Server on localhost 2049 Y 6111 Self-heal Daemon on localhost N/A Y 6113 NFS Server on hicks.lab.eng.blr.redhat.com 2049 Y 5607 Self-heal Daemon on hicks.lab.eng.blr.redhat.com N/A Y 5613 There are no active volume tasks
Created attachment 793044 [details] SOS Reports
Able to recreate this issue couple of times. glustershd log: ================= [2013-09-03 07:31:49.146006] I [afr-self-heal-common.c:2827:afr_log_self_heal_completion_status] 0-vol_dis_1_rep_2-replicate-0: foreground data self heal is successfully completed, from vol_dis_1_rep_2-client-1 with 7 7 sizes - Pending matrix: [ [ 2 145 ] [ 1 1 ] ] on <gfid:4d97ab4e-d9f4-48ad-8e4c-13de46e6cb92>
Posted the patch upstream: http://review.gluster.org/5763
https://code.engineering.redhat.com/gerrit/12448
Verified the fix on "glusterfs 3.4.0.36rhs" build with the steps mentioned in description. Bug is fixed. Moving the bug to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1769.html