Description of problem: After rebuilding brick A in a 2-node replicate cluster, self-heal copies data from brick B to brick A, but then updates the timestamp of the file to the time the copy finished on brick A. This timestamp is updated on both brick A and B, overwriting the actual timestamp of the file. This only happens when brick A (the brick being rebuilt) is the first brick listed in volume info, but *not* when brick A is the second brick listed. Version-Release number of selected component (if applicable): Behavior seen on: * Centos 5.9 with ext3 filesystems * glusterfs-server-3.3.1-1.el5 * glusterfs-3.3.1-1.el5 * glusterfs-fuse-3.3.1-1.el5 * Centos 6.3 with btrfs filesystems * glusterfs-server-3.3.1-1.el6.x86_64 * glusterfs-3.3.1-1.el6.x86_64 * glusterfs-fuse-3.3.1-1.el6.x86_64 How reproducible: very - happens every time we try to rebuild primary brick Steps to Reproduce: 1. create 2-brick replicate cluster 2. copy data into cluster 3. rebuild first brick with new filesystem, but same node UUID. In our case, we did a full kickstart (to simulate catastrophic hardware failure) and used the procedure at: http://gluster.org/community/documentation/index.php/Gluster_3.2:_Brick_Restoration_-_Replace_Crashed_Server 4. initiate self-heal (any way -- stat files, heal full, etc.) Actual results: after replicate copies bits from brick B to brick A, timestamps on both bricks get updated to the time each file was copied to brick A Expected results: timestamps should be preserved Additional info: original email to gluster-users is at: http://www.gluster.org/pipermail/gluster-users/2013-February/035581.html
Created attachment 706248 [details] data.log TRACE output This is TRACE log output (data.log) while issuing a self-heal by running: US-CA1 host13 (root) /var/log/glusterfs:$ ls -l /data/lib/Rcs.pm -r--r--r-- 1 root root 42400 Mar 5 17:27 /data/lib/Rcs.pm shortly thereafter the timestamp is updated: US-CA1 host13 (root) /var/log/glusterfs:$ ls -la /gfs0/lib/Rcs.pm -r--r--r-- 2 root root 42400 Mar 6 20:38 /gfs0/lib/Rcs.pm US-CA1 host13 (root) /var/log/glusterfs:$ ls -l /data/lib/Rcs.pm -r--r--r-- 1 root root 42400 Mar 6 20:38 /data/lib/Rcs.pm If there are other logs that you think would be useful, let me know. I'm sitting in #gluser IRC with tjstansell nick if folks would like to chat realtime. Todd
JoeJulian built 3.3.0 for me so I could test that: http://koji.fedoraproject.org/koji/taskinfo?taskID=5086588 I used the rpms produced at http://kojipkgs.fedoraproject.org/work/tasks/6588/5086588/ and uninstalled my 3.3.1 packages, installed the 3.3.0 ones, updated the UUID so they didn't change on the servers and started things back up. I did a self-heal on another directory and the timestamps there also got updated. Todd
I have simplified things down to the following test case to replicate this problem: 1. stop all gluster processes on node A 2. nuke both links to a file on the brick: US-CA1 host13 (root)$ getfattr -d -e hex -m "" /gfs0/brick1/bin/sync-data getfattr: Removing leading '/' from absolute path names # file: gfs0/brick1/bin/sync-data trusted.afr.gv0-client-0=0x000000000000000000000000 trusted.afr.gv0-client-1=0x000000000000000000000000 trusted.gfid=0xabe895ece50241d28502952de49693e1 US-CA1 host13 (root)$ rm -f /gfs0/brick1/bin/sync-data /gfs0/brick1/.glusterfs/ab/e8/abe895ec-e502-41d2-8502-952de49693e1 3. start glusterd back up 4. self heal the file When I do this from host13 (the second brick in the list) it works as expected. When I do this from host14 (the primary brick), the file is healed, but the timestamp gets updated. At this point I've verified the same behavior with 3.3.0, 3.3.1, and 3.4.0alpha. Todd
Here's the set of commands I ran to reproduce this on a clean install: 1. install glusterfs host14$ mkdir /gfs0/brick13 /gfs0/brick14 host14$ yum install glusterfs glusterfs-server glusterfs-fuse host14$ service glusterd start host13$ mkdir /gfs0/brick13 /gfs0/brick14 host13$ yum install glusterfs glusterfs-server glusterfs-fuse host13$ service glusterd start 2. build two identical gluster volumes, except for ordering of bricks: host14$ gluster peer probe host13 host14$ gluster peer status host14$ gluster volume create gv13 replica 2 transport tcp host13:/gfs0/brick13 host14:/gfs0/brick13 host14$ gluster volume create gv14 replica 2 transport tcp host14:/gfs0/brick14 host13:/gfs0/brick14 host14$ gluster volume set gv13 nfs.disable on host14$ gluster volume set gv14 nfs.disable on host14$ gluster volume start gv13 host14$ gluster volume start gv14 3. mount volumes and copy data into them host13:/gv13 /gv13 glusterfs defaults 0 0 host13:/gv14 /gv14 glusterfs defaults 0 0 host13$ mkdir /gv13 /gv14 host13$ mount /gv13 host13$ mount /gv14 host13$ cp -p /etc/init.d/gluster* /gv13 host13$ cp -p /etc/init.d/gluster* /gv14 4. simulate disk failure host13$ umount /gv13 host14$ umount /gv14 host13$ service glusterd stop host13$ service glusterfsd stop host13$ ps -ef | grep gluster host13$ pkill gluster host14$ service glusterd stop host14$ service glusterfsd stop host14$ ps -ef | grep gluster host14$ pkill gluster host14$ getfattr -d -e hex -m "" /gfs0/brick13/glusterfs-heal host14$ rm -f /gfs0/brick13/glusterfs-heal /gfs0/brick13/.glusterfs/fc/bc/fcbc9c6b-d87b-4b4d-921d-ea6a8fa25d3f host14$ getfattr -d -e hex -m "" /gfs0/brick14/glusterfs-heal host14$ rm -f /gfs0/brick14/glusterfs-heal /gfs0/brick14/.glusterfs/66/0d/660d3734-04d0-438f-9746-a1d719ce6b00 host14$ service glusterd start 5. trigger self-heal and watch timestamps get updated, but only on /gv14: US-CA1 host13 (root) /:$ ls -la /gv13 /gv14 /gv13: total 10 drwxr-xr-x 3 root root 98 Mar 7 00:38 . dr-xr-xr-x. 26 root root 4096 Mar 7 00:37 .. -rwxr-xr-x 1 root root 1979 Feb 8 13:32 glusterd -rwxr-xr-x 1 root root 468 Mar 4 23:55 glusterfs-heal -rwxr-xr-x 1 root root 572 Mar 4 23:55 glusterfs-mount -rwxr-xr-x 1 root root 2019 Feb 8 13:32 glusterfsd /gv14: total 10 drwxr-xr-x 3 root root 98 Mar 7 00:38 . dr-xr-xr-x. 26 root root 4096 Mar 7 00:37 .. -rwxr-xr-x 1 root root 1979 Feb 8 13:32 glusterd -rwxr-xr-x 1 root root 468 Mar 4 23:55 glusterfs-heal -rwxr-xr-x 1 root root 572 Mar 4 23:55 glusterfs-mount -rwxr-xr-x 1 root root 2019 Feb 8 13:32 glusterfsd US-CA1 host13 (root) /:$ ls -la /gv13 /gv14 /gv13: total 10 drwxr-xr-x 3 root root 98 Mar 7 00:38 . dr-xr-xr-x. 26 root root 4096 Mar 7 00:37 .. -rwxr-xr-x 1 root root 1979 Feb 8 13:32 glusterd -rwxr-xr-x 1 root root 468 Mar 4 23:55 glusterfs-heal -rwxr-xr-x 1 root root 572 Mar 4 23:55 glusterfs-mount -rwxr-xr-x 1 root root 2019 Feb 8 13:32 glusterfsd /gv14: total 10 drwxr-xr-x 3 root root 98 Mar 7 00:38 . dr-xr-xr-x. 26 root root 4096 Mar 7 00:37 .. -rwxr-xr-x 1 root root 1979 Feb 8 13:32 glusterd -rwxr-xr-x 1 root root 468 Mar 7 00:41 glusterfs-heal -rwxr-xr-x 1 root root 572 Mar 4 23:55 glusterfs-mount -rwxr-xr-x 1 root root 2019 Feb 8 13:32 glusterfsd
And then doing the same thing, but from the other node, it affects only /gv13 (since it's primary brick is on host13): host13$ service glusterd stop host13$ service glusterfsd stop host13$ pkill gluster host13$ rm -f /gfs0/brick13/glusterfs-heal /gfs0/brick13/.glusterfs/fc/bc/fcbc9c6b-d87b-4b4d-921d-ea6a8fa25d3f host13$ rm -f /gfs0/brick14/glusterfs-heal /gfs0/brick14/.glusterfs/66/0d/660d3734-04d0-438f-9746-a1d719ce6b00 host13$ service glusterd start see only /gv13 affected: US-CA1 host13 (root) /:$ date; ls -l /gv*/glusterfs-heal Thu Mar 7 01:18:27 UTC 2013 -rwxr-xr-x 1 root root 468 Mar 4 23:55 /gv13/glusterfs-heal -rwxr-xr-x 1 root root 468 Mar 7 00:41 /gv14/glusterfs-heal US-CA1 host13 (root) /:$ date; ls -l /gv*/glusterfs-heal Thu Mar 7 01:18:33 UTC 2013 -rwxr-xr-x 1 root root 468 Mar 7 01:18 /gv13/glusterfs-heal -rwxr-xr-x 1 root root 468 Mar 7 00:41 /gv14/glusterfs-heal
I've done some more testing using a 3-brick replicate cluster locally on one host. It will always use the timestamp from brick1 when repairing the other bricks. In the case of removing a file from brick2 and touching brick3 (so it was different than brick1), all files will get healed to brick1's timestamp. When removing the file from brick1, it will always update all 3 bricks to the timestamp of when it restored the data to brick1. It looks to me like the metadata heal always happens after the data heal and the metadata heal *always* uses the first brick as the authoritative source for metadata. Interestingly, if I chown the file in brick1 and touch the file in brick2 and brick3 so they all have separate metadata, it seems to randomly choose a brick to use as the authoritative source, which is probably fine. It's a split-brain scenario for metadata, so it just chooses one. before: US-CA1 host13 (root) /:$ ls -la --full-time /gfs0/brick?/g* -rwxr-xr-x 2 bstansell root 468 2013-03-07 15:33:10.225361000 +0000 /gfs0/brick2/glusterfs-heal -rwxr-xr-x 2 dstuit root 468 2013-03-07 15:33:10.225361533 +0000 /gfs0/brick3/glusterfs-heal after: US-CA1 host13 (root) /:$ ls -la --full-time /gfs0/brick?/g* -rwxr-xr-x 2 bstansell root 468 2013-03-07 15:41:15.611249000 +0000 /gfs0/brick1/glusterfs-heal -rwxr-xr-x 2 bstansell root 468 2013-03-07 15:41:15.611249000 +0000 /gfs0/brick2/glusterfs-heal -rwxr-xr-x 2 bstansell root 468 2013-03-07 15:41:15.611249000 +0000 /gfs0/brick3/glusterfs-heal Any time it's restoring the file to brick1, it updates the timestamp. Here's a fun one... simulating brick1 going offline briefly. 1. kill the glusterfs process for brick1 2. update the file in /gv3 US-CA1 host13 (root) /:$ cp -p /etc/init.d/glusterfsd /gv3/glutserfs-heal 3. see the discrepancy: US-CA1 host13 (root) /:$ ls -la --full-time /gfs0/brick?/g* -rwxr-xr-x 2 root root 468 2013-03-07 15:49:19.845462000 +0000 /gfs0/brick1/glusterfs-heal -rwxr-xr-x 2 root root 2019 2013-02-08 13:32:41.000000000 +0000 /gfs0/brick2/glusterfs-heal -rwxr-xr-x 2 root root 2019 2013-02-08 13:32:41.000000000 +0000 /gfs0/brick3/glusterfs-heal 4. restart glusterd to bring brick1 back online US-CA1 host13 (root) /:$ service glusterd stop Stopping glusterd: [ OK ] US-CA1 host13 (root) /:$ service glusterfsd stop Stopping glusterfsd: [ OK ] US-CA1 host13 (root) /:$ service glusterd start Starting glusterd: [ OK ] 5. check out self-heal results: US-CA1 host13 (root) /:$ ls -la --full-time /gfs0/brick?/g* -rwxr-xr-x 2 root root 2019 2013-03-07 15:53:39.102271000 +0000 /gfs0/brick1/glusterfs-heal -rwxr-xr-x 2 root root 2019 2013-03-07 15:53:39.102271000 +0000 /gfs0/brick2/glusterfs-heal -rwxr-xr-x 2 root root 2019 2013-03-07 15:53:39.102271000 +0000 /gfs0/brick3/glusterfs-heal It used the newly updated data, but updated all of the timestamps. Todd
I'm marking this as urgent, since it involves corruption of metadata. Todd
I've been digging through trace logs and found the following. Not sure if this is the issue or not. The logs show metadata+data self-healing twice: [2013-03-07 17:00:46.972953] D [afr-common.c:1211:afr_lookup_set_self_heal_params_by_xattr] 0-gv3-replicate-0: metadata self-heal is pending for <gfid:78a027b5-dae3-4851-904c-926034438fb4>. [2013-03-07 17:00:46.972958] D [afr-common.c:1225:afr_lookup_set_self_heal_params_by_xattr] 0-gv3-replicate-0: data self-heal is pending for <gfid:78a027b5-dae3-4851-904c-926034438fb4>. [2013-03-07 17:00:46.972963] D [afr-common.c:1211:afr_lookup_set_self_heal_params_by_xattr] 0-gv3-replicate-0: metadata self-heal is pending for <gfid:78a027b5-dae3-4851-904c-926034438fb4>. [2013-03-07 17:00:46.972969] D [afr-common.c:1225:afr_lookup_set_self_heal_params_by_xattr] 0-gv3-replicate-0: data self-heal is pending for <gfid:78a027b5-dae3-4851-904c-926034438fb4>. which is fine ... it then proceeds to do the first metadata check ... [2013-03-07 17:00:46.973038] T [afr-self-heal-common.c:2357:afr_self_heal] 0-gv3-replicate-0: proceeding to metadata check on <gfid:78a027b5-dae3-4851-904c-926034438fb4> [2013-03-07 17:00:46.973070] T [afr-self-heal-common.c:1027:afr_sh_missing_entries_done] 0-gv3-replicate-0: proceeding to metadata check on <gfid:78a027b5-dae3-4851-904c-926034438fb4> but fails to attain non-blocking locks ... [2013-03-07 17:00:46.973924] T [afr-lk-common.c:1472:afr_nonblocking_inodelk_cbk] 0-gv3-replicate-0: Last inode locking reply received [2013-03-07 17:00:46.973930] T [afr-lk-common.c:1486:afr_nonblocking_inodelk_cbk] 0-gv3-replicate-0: 0 servers locked. Trying again with blocking calls [2013-03-07 17:00:46.973937] D [afr-lk-common.c:456:transaction_lk_op] 0-gv3-replicate-0: lk op is for a self heal [2013-03-07 17:00:46.973944] T [afr-lk-common.c:716:afr_unlock_inodelk] 0-gv3-replicate-0: No internal locks unlocked [2013-03-07 17:00:46.973950] D [afr-self-heal-metadata.c:478:afr_sh_metadata_post_nonblocking_inodelk_cbk] 0-gv3-replicate-0: Non Blocking metadata inodelks failed for <gfid:78a027b5-dae3-4851-904c-926034438fb4>. [2013-03-07 17:00:46.973956] D [afr-self-heal-metadata.c:480:afr_sh_metadata_post_nonblocking_inodelk_cbk] 0-gv3-replicate-0: Metadata self-heal failed for <gfid:78a027b5-dae3-4851-904c-926034438fb4>. but instead of retrying with blocking locks, like it says, it simply moves on to the data heal: [2013-03-07 17:00:46.973961] D [afr-self-heal-metadata.c:61:afr_sh_metadata_done] 0-gv3-replicate-0: proceeding to data check on <gfid:78a027b5-dae3-4851-904c-926034438fb4> though later on it does try to sync metadata again: [2013-03-07 17:00:46.974540] T [afr-self-heal-metadata.c:355:afr_sh_metadata_sync_prepare] 0-gv3-replicate-0: syncing metadata of <gfid:78a027b5-dae3-4851-904c-926034438fb4> from subvolume gv3-client-1 to 1 active sinks ... [2013-03-07 17:00:46.975364] D [afr-self-heal-metadata.c:267:afr_sh_metadata_sync] 0-gv3-replicate-0: self-healing metadata of <gfid:78a027b5-dae3-4851-904c-926034438fb4> from gv3-client-1 to gv3-client-0 but then see the same failure to get non-blocking locks: [2013-03-07 17:00:46.975849] D [afr-self-heal-data.c:1134:afr_sh_data_post_nonblocking_inodelk_cbk] 0-gv3-replicate-0: Non Blocking data inodelks failed for <gfid:78a027b5-dae3-4851-904c-926034438fb4>. by bc60dc7ce87f0000 Interestingly, the pending_matrix shows -2 for the 2 bricks where the data was being synced from (not that I understand the meaning of this): [2013-03-07 17:00:46.976090] D [afr-self-heal-common.c:2510:afr_sh_erase_pending] 0-gv3-replicate-0: Delta matrix for: 7089dc7ce87f0000 [2013-03-07 17:00:46.976107] D [afr-self-heal-common.c:138:afr_sh_print_pending_matrix] 0-gv3-replicate-0: pending_matrix: [ 0 0 0 ] [2013-03-07 17:00:46.976121] D [afr-self-heal-common.c:138:afr_sh_print_pending_matrix] 0-gv3-replicate-0: pending_matrix: [ -2 0 0 ] [2013-03-07 17:00:46.976134] D [afr-self-heal-common.c:138:afr_sh_print_pending_matrix] 0-gv3-replicate-0: pending_matrix: [ -2 0 0 ] and then proceeds to data check ... [2013-03-07 17:00:46.977685] D [afr-self-heal-metadata.c:61:afr_sh_metadata_done] 0-gv3-replicate-0: proceeding to data check on <gfid:78a027b5-dae3-4851-904c-926034438fb4> I'll attach the full log in a bit. Todd
Created attachment 706770 [details] self-heal trace log Here's the self-heal log with trace enabled I referenced from the previous comment. Todd
Todd, I tried to re-create the issue locally on my machine, but I am not able to :-(. Here are the two things I tried using the test-frame work we have. Please let me know what I am missing. $CLI is gluster --mode=script $V0 is name of the volume $H0 is ip/hostname of the machine $M0 is the mount point $B0 is the directory in which bricks reside in the script. TEST fails if the command fails EXPECT <expected-value> <command-which-is-supposed-to-give-value> Test 1: #!/bin/bash . $(dirname $0)/../include.rc . $(dirname $0)/../volume.rc cleanup; TEST glusterd TEST pidof glusterd TEST $CLI volume create $V0 replica 2 $H0:$B0/${V0}0 $H0:$B0/${V0}1 TEST $CLI volume set $V0 cluster.self-heal-daemon off TEST $CLI volume start $V0 TEST glusterfs --volfile-id=/$V0 --volfile-server=$H0 $M0 --attribute-timeout=0 --entry-timeout=0 #stat function gives modify time/ access time. We get the time stamp based on type function get_timestamp { local f=$1 local type=$2 stat $f | grep $type | grep -v Uid | awk '{print $3}' | cut -f1 -d'.' } #kill brick1 TEST kill_brick $V0 $H0 $B0/${V0}1 #file is created only in brick0 healing will happen from brick0 to brick1 echo "abc" > $M0/a TEST $CLI volume start $V0 force #wait till brick1 comes up EXPECT_WITHIN 30 "1" afr_child_up_status $V0 1 #kill brick0 TEST kill_brick $V0 $H0 $B0/${V0}0 #file is created only in brick1 healing will happen from brick1 to brick0 echo "abc" > $M0/b TEST $CLI volume start $V0 force #wait till brick0 comes up EXPECT_WITHIN 30 "1" afr_child_up_status $V0 0 #get time stamps before self-heal TEST access_tstampa_0=$(get_timestamp $B0/${V0}0/a Access) TEST modify_tstampa_0=$(get_timestamp $B0/${V0}0/a Modify) TEST access_tstampb_1=$(get_timestamp $B0/${V0}1/b Access) TEST modify_tstampb_1=$(get_timestamp $B0/${V0}1/b Modify) #wait for 2 seconds to see if after healing the time stamps differ or not sleep 2 #trigger self-heals find $M0 | xargs stat 1>/dev/null #get time stamps after self-heal on the bricks they are healed to TEST access_tstampa_1=$(get_timestamp $B0/${V0}1/a Access) TEST modify_tstampa_1=$(get_timestamp $B0/${V0}1/a Modify) TEST access_tstampb_0=$(get_timestamp $B0/${V0}0/b Access) TEST modify_tstampb_0=$(get_timestamp $B0/${V0}0/b Modify) #check if the timestamps match or not. EXPECT $access_tstampa_0 echo $access_tstampa_1 EXPECT $modify_tstampa_0 echo $modify_tstampa_1 EXPECT $access_tstampb_0 echo $access_tstampb_1 EXPECT $modify_tstampb_0 echo $modify_tstampb_1 cleanup; Test-2: #!/bin/bash . $(dirname $0)/../include.rc . $(dirname $0)/../volume.rc cleanup; #Simulates a drive replacement by deleting file and its gfid-link and then # check if the timestamps match after self-heal TEST glusterd TEST pidof glusterd TEST $CLI volume create $V0 replica 2 $H0:$B0/${V0}0 $H0:$B0/${V0}1 TEST $CLI volume set $V0 cluster.self-heal-daemon off TEST $CLI volume start $V0 TEST glusterfs --volfile-id=/$V0 --volfile-server=$H0 $M0 --attribute-timeout=0 --entry-timeout=0 function get_timestamp { local f=$1 local type=$2 stat $f | grep $type | grep -v Uid | awk '{print $3}' | cut -f1 -d'.' } echo "abc" > $M0/a echo "abc" > $M0/b TEST kill_brick $V0 $H0 $B0/${V0}0 #deletes both the file and its gfid link from .glusterfs TEST gf_rm_file_and_gfid_link $B0/${V0}0/a TEST gf_rm_file_and_gfid_link $B0/${V0}0/b TEST access_tstampa_1=$(get_timestamp $B0/${V0}1/a Access) TEST modify_tstampa_1=$(get_timestamp $B0/${V0}1/a Modify) TEST access_tstampb_1=$(get_timestamp $B0/${V0}1/b Access) TEST modify_tstampb_1=$(get_timestamp $B0/${V0}1/b Modify) TEST $CLI volume start $V0 force EXPECT_WITHIN 30 "1" afr_child_up_status $V0 0 sleep 2 find $M0 | xargs stat 1>/dev/null TEST access_tstampa_0=$(get_timestamp $B0/${V0}0/a Access) TEST modify_tstampa_0=$(get_timestamp $B0/${V0}0/a Modify) TEST access_tstampb_0=$(get_timestamp $B0/${V0}0/b Access) TEST modify_tstampb_0=$(get_timestamp $B0/${V0}0/b Modify) EXPECT $access_tstampa_0 echo $access_tstampa_1 EXPECT $modify_tstampa_0 echo $modify_tstampa_1 EXPECT $access_tstampb_0 echo $access_tstampb_1 EXPECT $modify_tstampb_0 echo $modify_tstampb_1 cleanup;
Here's the 2-brick test case: US-CA1 host13 (root) /var/log/glusterfs:$ mkdir /gfs0/brick0{1,2} US-CA1 host13 (root) /var/log/glusterfs:$ gluster volume create gv01 replica 2 transport tcp host13:/gfs0/brick01 host13:/gfs0/brick02 Multiple bricks of a replicate volume are present on the same server. This setup is not optimal. Do you still want to continue creating the volume? (y/n) y volume create: gv01: success: please start the volume to access data US-CA1 host13 (root) /var/log/glusterfs:$ gluster volume set gv01 nfs.disable on volume set: success US-CA1 host13 (root) /var/log/glusterfs:$ gluster volume start gv01 volume start: gv01: success US-CA1 host13 (root) /var/log/glusterfs:$ mount -t glusterfs host13:/gv01 /gv01 US-CA1 host13 (root) /var/log/glusterfs:$ cp -p /etc/init.d/glusterd /gv01 US-CA1 host13 (root) /var/log/glusterfs:$ ls -la --full-time /gfs0/brick0?/g* -rwxr-xr-x 2 root root 1979 2013-02-08 13:32:41.000000000 +0000 /gfs0/brick01/glusterd -rwxr-xr-x 2 root root 1979 2013-02-08 13:32:41.000000000 +0000 /gfs0/brick02/glusterd US-CA1 host13 (root) /var/log/glusterfs:$ ps -ef | grep brick01 root 4926 1 0 04:48 ? 00:00:00 /usr/sbin/glusterfsd -s localhost --volfile-id gv01.host13.gfs0-brick01 -p /var/lib/glusterd/vols/gv01/run/host13-gfs0-brick01.pid -S /var/run/a7c26a6374c851d7baf4c925dfd87d0f.socket --brick-name /gfs0/brick01 -l /var/log/glusterfs/bricks/gfs0-brick01.log --xlator-option *-posix.glusterd-uuid=860e1ecd-78c7-4a99-a642-e8eac0919a56 --brick-port 49159 --xlator-option gv01-server.listen-port=49159 root 5012 7522 0 04:48 pts/0 00:00:00 grep brick01 US-CA1 host13 (root) /var/log/glusterfs:$ kill 4926 US-CA1 host13 (root) /var/log/glusterfs:$ rm -f /gfs0/brick01/glusterd /gfs0/brick01/.glusterfs/ 00/ b5/ indices/ landfill/ US-CA1 host13 (root) /var/log/glusterfs:$ rm -f /gfs0/brick01/glusterd /gfs0/brick01/.glusterfs/b5/5b/b55b1a9e-0ebd-4d8a-8837-3f5d1c9977be US-CA1 host13 (root) /var/log/glusterfs:$ ls -la --full-time /gfs0/brick0?/g* -rwxr-xr-x 2 root root 1979 2013-02-08 13:32:41.000000000 +0000 /gfs0/brick02/glusterd US-CA1 host13 (root) /var/log/glusterfs:$ gluster volume start gv01 force volume start: gv01: success US-CA1 host13 (root) /var/log/glusterfs:$ ls -la --full-time /gfs0/brick0?/g* -rwxr-xr-x 2 root root 1979 2013-02-08 13:32:41.000000000 +0000 /gfs0/brick02/glusterd US-CA1 host13 (root) /var/log/glusterfs:$ gluster volume heal gv01 full Launching Heal operation on volume gv01 has been successful Use heal info commands to check status US-CA1 host13 (root) /var/log/glusterfs:$ ls -la --full-time /gfs0/brick0?/g* -rwxr-xr-x 2 root root 1979 2013-03-08 04:49:49.645296000 +0000 /gfs0/brick01/glusterd -rwxr-xr-x 2 root root 1979 2013-03-08 04:49:49.645296000 +0000 /gfs0/brick02/glusterd Todd
http://review.gluster.org/4658
CHANGE: http://review.gluster.org/4658 (cluster/afr: Preserve mtime in self-heal) merged in master by Vijay Bellur (vbellur)
Will this fix be backported to 3.3.1? or only available in some future 3.4 build? I'm looking for the most stable way to be able to utilize this soon on Centos 6. Todd
Given that this is a fix for data corruption, I'm officially asking for this to get backported to at least the last stable release, 3.3.1. I'm not sure how to officially request this, but it doesn't make sense to force people to upgrade to 3.4 to get the fix. Todd
+1. Please backport bugfixes to 3.3.
(In reply to comment #15) > Given that this is a fix for data corruption, I'm officially asking for this > to get backported to at least the last stable release, 3.3.1. I'm not sure > how to officially request this, but it doesn't make sense to force people to > upgrade to 3.4 to get the fix. > > Todd right now requesting back ports is ad hoc, but we'll document a process for it. keep an eye on gluster-users. http://review.gluster.com/4663 http://review.gluster.com/4664 Those are the backport patches for 3.4 and 3.3 Pranith
Thanks for the backports. Any idea how long it might take to get reviews for these? We're trying to use 3.3.1, but without this patch, we can't feel confident and this piece of our project is on hold. So, I'm just curious how long I should expect the backport review to take so I can prepare. I know I could build it myself, but I'd much rather have a package that's built using kkeithley's standard build options/environment. Therefore, I'm waiting for this to go in. Thanks for the great turnaround on this bugfix, by the way! Todd
COMMIT: http://review.gluster.org/4664 committed in release-3.3 by Vijay Bellur (vbellur) ------ commit a216f5f44675bfe189c318171dbc50e1c19bfc26 Author: Anand Avati <avati> Date: Tue Apr 9 17:22:01 2013 -0700 tests: fix further issues with bug-874498.t The failure of bug-874498.t seems to be a "bug" in glustershd. The situation seems to be when both subvolumes of a replica are "local" to glustershd, and in such cases glustershd is sensitive to the order in which the subvols come up. The core of the issue itself is that, without the patch (#4784), self-heal daemon completes the processing of index and no entries are left inside the xattrop index after a few seconds of volume start force. However with the patch, the stale "backing file" (against which index performs link()) is left. The likely reason is that an "INDEX" based crawl is not happening against the subvol when this patch is applied. Before #4784 patch, the order in which subvols came up was : [2013-04-09 22:55:35.117679] I [client-handshake.c:1456:client_setvolume_cbk] 0-patchy-client-0: Connected to 10.3.129.13:49156, attached to remote volume '/d/backends/brick1'. ... [2013-04-09 22:55:35.118399] I [client-handshake.c:1456:client_setvolume_cbk] 0-patchy-client-1: Connected to 10.3.129.13:49157, attached to remote volume '/d/backends/brick2'. However, with the patch, the order is reversed: [2013-04-09 22:53:34.945370] I [client-handshake.c:1456:client_setvolume_cbk] 0-patchy-client-1: Connected to 10.3.129.13:49153, attached to remote volume '/d/backends/brick2'. ... [2013-04-09 22:53:34.950966] I [client-handshake.c:1456:client_setvolume_cbk] 0-patchy-client-0: Connected to 10.3.129.13:49152, attached to remote volume '/d/backends/brick1'. The index in brick2 has the list of files/gfid to heal. It appears to be the case that when brick1 is the first subvol to be detected as coming up, somehow an INDEX based crawl is clearing all the index entries in brick2, but if brick2 comes up as the first subvol, then the backing file is left stale. Also, doing a "gluster volume heal full" seems to leave out stale backing files too. As the crawl is performed on the namespace and the backing file is never encountered there to get cleared out. So the interim (possibly permanent) fix is to have the script issue a regular self-heal command (and not a "full" one). The failure of the script itself is non-critical. The data files are all healed, and it is just the backing file which is left behind. The stale backing file too gets cleared in the next index based healing, either triggered manually or after 10mins. Change-Id: I5deb79652ef449b7e88684311e804a8a2aa4725d BUG: 874498 Signed-off-by: Anand Avati <avati> Reviewed-on: http://review.gluster.org/4798 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Jeff Darcy <jdarcy>
COMMIT: http://review.gluster.org/4663 committed in release-3.4 by Anand Avati (avati) ------ commit 77a02c4dd8467e2a78b3ab3cdef95178ef4b1898 Author: Kaleb S. KEITHLEY <kkeithle> Date: Thu Apr 11 09:36:35 2013 -0400 object-storage: rebase Swift to 1.8.0 (grizzly) Two minor tweaks found while packaging 3.4.0-0.1.alpha2 for Fedora 19 BUG: 948039 Change-Id: I97175636164702cf4042bc4a18ffead76ad386cb Signed-off-by: Kaleb S. KEITHLEY <kkeithle> Reviewed-on: http://review.gluster.org/4807 Reviewed-by: Jeff Darcy <jdarcy> Tested-by: Gluster Build System <jenkins.com>