Bug 918437

Summary:

timestamps updated after self-heal following primary brick rebuild

Product:

[Community] GlusterFS

Reporter:

Todd Stansell <todd+rhbugs>

Component:

replicate

Assignee:

Pranith Kumar K <pkarampu>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

3.3.1

CC:

gluster-bugs, jdarcy, joe

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Linux

Whiteboard:

Fixed In Version:

glusterfs-3.4.0

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

955469 (view as bug list)

Environment:

Last Closed:

2013-07-24 17:50:48 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

918917, 955469

Attachments:

Description	Flags
data.log TRACE output	none
self-heal trace log	none

Description Todd Stansell 2013-03-06 08:59:12 UTC

Description of problem:
After rebuilding brick A in a 2-node replicate cluster, self-heal copies data from brick B to brick A, but then updates the timestamp of the file to the time the copy finished on brick A.  This timestamp is updated on both brick A and B, overwriting the actual timestamp of the file.

This only happens when brick A (the brick being rebuilt) is the first brick listed in volume info, but *not* when brick A is the second brick listed.

Version-Release number of selected component (if applicable):
Behavior seen on:

* Centos 5.9 with ext3 filesystems
  * glusterfs-server-3.3.1-1.el5
  * glusterfs-3.3.1-1.el5
  * glusterfs-fuse-3.3.1-1.el5

* Centos 6.3 with btrfs filesystems
  * glusterfs-server-3.3.1-1.el6.x86_64
  * glusterfs-3.3.1-1.el6.x86_64
  * glusterfs-fuse-3.3.1-1.el6.x86_64


How reproducible:
very - happens every time we try to rebuild primary brick


Steps to Reproduce:
1. create 2-brick replicate cluster
2. copy data into cluster
3. rebuild first brick with new filesystem, but same node UUID. In our case, we did a full kickstart (to simulate catastrophic hardware failure) and used the procedure at:
http://gluster.org/community/documentation/index.php/Gluster_3.2:_Brick_Restoration_-_Replace_Crashed_Server
4. initiate self-heal (any way -- stat files, heal full, etc.)
  
Actual results:
after replicate copies bits from brick B to brick A, timestamps on both bricks get updated to the time each file was copied to brick A

Expected results:
timestamps should be preserved

Additional info:
original email to gluster-users is at:
http://www.gluster.org/pipermail/gluster-users/2013-February/035581.html

Comment 1 Todd Stansell 2013-03-06 20:51:39 UTC

Created attachment 706248 [details]
data.log TRACE output

This is TRACE log output (data.log) while issuing a self-heal by running:

US-CA1 host13 (root) /var/log/glusterfs:$ ls -l /data/lib/Rcs.pm
-r--r--r-- 1 root root 42400 Mar  5 17:27 /data/lib/Rcs.pm

shortly thereafter the timestamp is updated:

US-CA1 host13 (root) /var/log/glusterfs:$ ls -la /gfs0/lib/Rcs.pm
-r--r--r-- 2 root root 42400 Mar  6 20:38 /gfs0/lib/Rcs.pm
US-CA1 host13 (root) /var/log/glusterfs:$ ls -l /data/lib/Rcs.pm
-r--r--r-- 1 root root 42400 Mar  6 20:38 /data/lib/Rcs.pm

If there are other logs that you think would be useful, let me know.

I'm sitting in #gluser IRC with tjstansell nick if folks would like to chat realtime.

Todd

Comment 2 Todd Stansell 2013-03-06 22:22:58 UTC

JoeJulian built 3.3.0 for me so I could test that:
http://koji.fedoraproject.org/koji/taskinfo?taskID=5086588

I used the rpms produced at http://kojipkgs.fedoraproject.org/work/tasks/6588/5086588/ and uninstalled my 3.3.1 packages, installed the 3.3.0 ones, updated the UUID so they didn't change on the servers and started things back up.  I did a self-heal on another directory and the timestamps there also got updated.

Todd

Comment 3 Todd Stansell 2013-03-06 23:50:51 UTC

I have simplified things down to the following test case to replicate this problem:

1. stop all gluster processes on node A
2. nuke both links to a file on the brick:

US-CA1 host13 (root)$ getfattr -d -e hex -m "" /gfs0/brick1/bin/sync-data
getfattr: Removing leading '/' from absolute path names
# file: gfs0/brick1/bin/sync-data
trusted.afr.gv0-client-0=0x000000000000000000000000
trusted.afr.gv0-client-1=0x000000000000000000000000
trusted.gfid=0xabe895ece50241d28502952de49693e1
US-CA1 host13 (root)$ rm -f /gfs0/brick1/bin/sync-data /gfs0/brick1/.glusterfs/ab/e8/abe895ec-e502-41d2-8502-952de49693e1

3. start glusterd back up
4. self heal the file

When I do this from host13 (the second brick in the list) it works as expected.  When I do this from host14 (the primary brick), the file is healed, but the timestamp gets updated.

At this point I've verified the same behavior with 3.3.0, 3.3.1, and 3.4.0alpha.

Todd

Comment 4 Todd Stansell 2013-03-07 01:04:41 UTC

Here's the set of commands I ran to reproduce this on a clean install:

1. install glusterfs
host14$ mkdir /gfs0/brick13 /gfs0/brick14
host14$ yum install glusterfs glusterfs-server glusterfs-fuse
host14$ service glusterd start
host13$ mkdir /gfs0/brick13 /gfs0/brick14
host13$ yum install glusterfs glusterfs-server glusterfs-fuse
host13$ service glusterd start

2. build two identical gluster volumes, except for ordering of bricks:
host14$ gluster peer probe host13
host14$ gluster peer status
host14$ gluster volume create gv13 replica 2 transport tcp host13:/gfs0/brick13 host14:/gfs0/brick13
host14$ gluster volume create gv14 replica 2 transport tcp host14:/gfs0/brick14 host13:/gfs0/brick14
host14$ gluster volume set gv13 nfs.disable on
host14$ gluster volume set gv14 nfs.disable on
host14$ gluster volume start gv13
host14$ gluster volume start gv14

3. mount volumes and copy data into them
host13:/gv13            /gv13           glusterfs       defaults        0 0
host13:/gv14            /gv14           glusterfs       defaults        0 0
host13$ mkdir /gv13 /gv14
host13$ mount /gv13
host13$ mount /gv14
host13$ cp -p /etc/init.d/gluster* /gv13
host13$ cp -p /etc/init.d/gluster* /gv14

4. simulate disk failure
host13$ umount /gv13
host14$ umount /gv14
host13$ service glusterd stop
host13$ service glusterfsd stop
host13$ ps -ef | grep gluster
host13$ pkill gluster
host14$ service glusterd stop
host14$ service glusterfsd stop
host14$ ps -ef | grep gluster
host14$ pkill gluster
host14$ getfattr -d -e hex -m "" /gfs0/brick13/glusterfs-heal
host14$ rm -f /gfs0/brick13/glusterfs-heal /gfs0/brick13/.glusterfs/fc/bc/fcbc9c6b-d87b-4b4d-921d-ea6a8fa25d3f
host14$ getfattr -d -e hex -m "" /gfs0/brick14/glusterfs-heal
host14$ rm -f /gfs0/brick14/glusterfs-heal /gfs0/brick14/.glusterfs/66/0d/660d3734-04d0-438f-9746-a1d719ce6b00
host14$ service glusterd start

5. trigger self-heal and watch timestamps get updated, but only on /gv14:
US-CA1 host13 (root) /:$ ls -la /gv13 /gv14
/gv13:
total 10
drwxr-xr-x   3 root root   98 Mar  7 00:38 .
dr-xr-xr-x. 26 root root 4096 Mar  7 00:37 ..
-rwxr-xr-x   1 root root 1979 Feb  8 13:32 glusterd
-rwxr-xr-x   1 root root  468 Mar  4 23:55 glusterfs-heal
-rwxr-xr-x   1 root root  572 Mar  4 23:55 glusterfs-mount
-rwxr-xr-x   1 root root 2019 Feb  8 13:32 glusterfsd

/gv14:
total 10
drwxr-xr-x   3 root root   98 Mar  7 00:38 .
dr-xr-xr-x. 26 root root 4096 Mar  7 00:37 ..
-rwxr-xr-x   1 root root 1979 Feb  8 13:32 glusterd
-rwxr-xr-x   1 root root  468 Mar  4 23:55 glusterfs-heal
-rwxr-xr-x   1 root root  572 Mar  4 23:55 glusterfs-mount
-rwxr-xr-x   1 root root 2019 Feb  8 13:32 glusterfsd
US-CA1 host13 (root) /:$ ls -la /gv13 /gv14
/gv13:
total 10
drwxr-xr-x   3 root root   98 Mar  7 00:38 .
dr-xr-xr-x. 26 root root 4096 Mar  7 00:37 ..
-rwxr-xr-x   1 root root 1979 Feb  8 13:32 glusterd
-rwxr-xr-x   1 root root  468 Mar  4 23:55 glusterfs-heal
-rwxr-xr-x   1 root root  572 Mar  4 23:55 glusterfs-mount
-rwxr-xr-x   1 root root 2019 Feb  8 13:32 glusterfsd

/gv14:
total 10
drwxr-xr-x   3 root root   98 Mar  7 00:38 .
dr-xr-xr-x. 26 root root 4096 Mar  7 00:37 ..
-rwxr-xr-x   1 root root 1979 Feb  8 13:32 glusterd
-rwxr-xr-x   1 root root  468 Mar  7 00:41 glusterfs-heal
-rwxr-xr-x   1 root root  572 Mar  4 23:55 glusterfs-mount
-rwxr-xr-x   1 root root 2019 Feb  8 13:32 glusterfsd

Comment 5 Todd Stansell 2013-03-07 01:21:17 UTC

And then doing the same thing, but from the other node, it affects only /gv13 (since it's primary brick is on host13):

host13$ service glusterd stop
host13$ service glusterfsd stop
host13$ pkill gluster
host13$ rm -f /gfs0/brick13/glusterfs-heal /gfs0/brick13/.glusterfs/fc/bc/fcbc9c6b-d87b-4b4d-921d-ea6a8fa25d3f
host13$ rm -f /gfs0/brick14/glusterfs-heal /gfs0/brick14/.glusterfs/66/0d/660d3734-04d0-438f-9746-a1d719ce6b00
host13$ service glusterd start

see only /gv13 affected:
US-CA1 host13 (root) /:$ date; ls -l /gv*/glusterfs-heal
Thu Mar  7 01:18:27 UTC 2013
-rwxr-xr-x 1 root root 468 Mar  4 23:55 /gv13/glusterfs-heal
-rwxr-xr-x 1 root root 468 Mar  7 00:41 /gv14/glusterfs-heal
US-CA1 host13 (root) /:$ date; ls -l /gv*/glusterfs-heal
Thu Mar  7 01:18:33 UTC 2013
-rwxr-xr-x 1 root root 468 Mar  7 01:18 /gv13/glusterfs-heal
-rwxr-xr-x 1 root root 468 Mar  7 00:41 /gv14/glusterfs-heal

Comment 6 Todd Stansell 2013-03-07 15:55:58 UTC

I've done some more testing using a 3-brick replicate cluster locally on one host.  It will always use the timestamp from brick1 when repairing the other bricks.  In the case of removing a file from brick2 and touching brick3 (so it was different than brick1), all files will get healed to brick1's timestamp.  When removing the file from brick1, it will always update all 3 bricks to the timestamp of when it restored the data to brick1.

It looks to me like the metadata heal always happens after the data heal and the metadata heal *always* uses the first brick as the authoritative source for metadata.

Interestingly, if I chown the file in brick1 and touch the file in brick2 and brick3 so they all have separate metadata, it seems to randomly choose a brick to use as the authoritative source, which is probably fine.  It's a split-brain scenario for metadata, so it just chooses one.

before:
US-CA1 host13 (root) /:$ ls -la --full-time /gfs0/brick?/g*
-rwxr-xr-x 2 bstansell root 468 2013-03-07 15:33:10.225361000 +0000 /gfs0/brick2/glusterfs-heal
-rwxr-xr-x 2 dstuit    root 468 2013-03-07 15:33:10.225361533 +0000 /gfs0/brick3/glusterfs-heal
after:
US-CA1 host13 (root) /:$ ls -la --full-time /gfs0/brick?/g*
-rwxr-xr-x 2 bstansell root 468 2013-03-07 15:41:15.611249000 +0000 /gfs0/brick1/glusterfs-heal
-rwxr-xr-x 2 bstansell root 468 2013-03-07 15:41:15.611249000 +0000 /gfs0/brick2/glusterfs-heal
-rwxr-xr-x 2 bstansell root 468 2013-03-07 15:41:15.611249000 +0000 /gfs0/brick3/glusterfs-heal

Any time it's restoring the file to brick1, it updates the timestamp.

Here's a fun one... simulating brick1 going offline briefly.

1. kill the glusterfs process for brick1
2. update the file in /gv3
US-CA1 host13 (root) /:$ cp -p /etc/init.d/glusterfsd /gv3/glutserfs-heal
3. see the discrepancy:
US-CA1 host13 (root) /:$ ls -la --full-time /gfs0/brick?/g*
-rwxr-xr-x 2 root root  468 2013-03-07 15:49:19.845462000 +0000 /gfs0/brick1/glusterfs-heal
-rwxr-xr-x 2 root root 2019 2013-02-08 13:32:41.000000000 +0000 /gfs0/brick2/glusterfs-heal
-rwxr-xr-x 2 root root 2019 2013-02-08 13:32:41.000000000 +0000 /gfs0/brick3/glusterfs-heal
4. restart glusterd to bring brick1 back online
US-CA1 host13 (root) /:$ service glusterd stop
Stopping glusterd:                                         [  OK  ]
US-CA1 host13 (root) /:$ service glusterfsd stop
Stopping glusterfsd:                                       [  OK  ]
US-CA1 host13 (root) /:$ service glusterd start
Starting glusterd:                                         [  OK  ]
5. check out self-heal results:
US-CA1 host13 (root) /:$ ls -la --full-time /gfs0/brick?/g*
-rwxr-xr-x 2 root root 2019 2013-03-07 15:53:39.102271000 +0000 /gfs0/brick1/glusterfs-heal
-rwxr-xr-x 2 root root 2019 2013-03-07 15:53:39.102271000 +0000 /gfs0/brick2/glusterfs-heal
-rwxr-xr-x 2 root root 2019 2013-03-07 15:53:39.102271000 +0000 /gfs0/brick3/glusterfs-heal

It used the newly updated data, but updated all of the timestamps.

Todd

Comment 7 Todd Stansell 2013-03-07 16:12:07 UTC

I'm marking this as urgent, since it involves corruption of metadata.

Todd

Comment 8 Todd Stansell 2013-03-07 19:09:56 UTC

I've been digging through trace logs and found the following.  Not sure if this is the issue or not.  The logs show metadata+data self-healing twice:

[2013-03-07 17:00:46.972953] D [afr-common.c:1211:afr_lookup_set_self_heal_params_by_xattr] 0-gv3-replicate-0: metadata self-heal is pending for <gfid:78a027b5-dae3-4851-904c-926034438fb4>.
[2013-03-07 17:00:46.972958] D [afr-common.c:1225:afr_lookup_set_self_heal_params_by_xattr] 0-gv3-replicate-0: data self-heal is pending for <gfid:78a027b5-dae3-4851-904c-926034438fb4>.
[2013-03-07 17:00:46.972963] D [afr-common.c:1211:afr_lookup_set_self_heal_params_by_xattr] 0-gv3-replicate-0: metadata self-heal is pending for <gfid:78a027b5-dae3-4851-904c-926034438fb4>.
[2013-03-07 17:00:46.972969] D [afr-common.c:1225:afr_lookup_set_self_heal_params_by_xattr] 0-gv3-replicate-0: data self-heal is pending for <gfid:78a027b5-dae3-4851-904c-926034438fb4>.

which is fine ... it then proceeds to do the first metadata check ...

[2013-03-07 17:00:46.973038] T [afr-self-heal-common.c:2357:afr_self_heal] 0-gv3-replicate-0: proceeding to metadata check on <gfid:78a027b5-dae3-4851-904c-926034438fb4>
[2013-03-07 17:00:46.973070] T [afr-self-heal-common.c:1027:afr_sh_missing_entries_done] 0-gv3-replicate-0: proceeding to metadata check on <gfid:78a027b5-dae3-4851-904c-926034438fb4>


but fails to attain non-blocking locks ...

[2013-03-07 17:00:46.973924] T [afr-lk-common.c:1472:afr_nonblocking_inodelk_cbk] 0-gv3-replicate-0: Last inode locking reply received
[2013-03-07 17:00:46.973930] T [afr-lk-common.c:1486:afr_nonblocking_inodelk_cbk] 0-gv3-replicate-0: 0 servers locked. Trying again with blocking calls
[2013-03-07 17:00:46.973937] D [afr-lk-common.c:456:transaction_lk_op] 0-gv3-replicate-0: lk op is for a self heal
[2013-03-07 17:00:46.973944] T [afr-lk-common.c:716:afr_unlock_inodelk] 0-gv3-replicate-0: No internal locks unlocked
[2013-03-07 17:00:46.973950] D [afr-self-heal-metadata.c:478:afr_sh_metadata_post_nonblocking_inodelk_cbk] 0-gv3-replicate-0: Non Blocking metadata inodelks failed for <gfid:78a027b5-dae3-4851-904c-926034438fb4>.
[2013-03-07 17:00:46.973956] D [afr-self-heal-metadata.c:480:afr_sh_metadata_post_nonblocking_inodelk_cbk] 0-gv3-replicate-0: Metadata self-heal failed for <gfid:78a027b5-dae3-4851-904c-926034438fb4>.

but instead of retrying with blocking locks, like it says, it simply moves on to  the data heal:

[2013-03-07 17:00:46.973961] D [afr-self-heal-metadata.c:61:afr_sh_metadata_done] 0-gv3-replicate-0: proceeding to data check on <gfid:78a027b5-dae3-4851-904c-926034438fb4>

though later on it does try to sync metadata again:

[2013-03-07 17:00:46.974540] T [afr-self-heal-metadata.c:355:afr_sh_metadata_sync_prepare] 0-gv3-replicate-0: syncing metadata of <gfid:78a027b5-dae3-4851-904c-926034438fb4> from subvolume gv3-client-1 to 1 active sinks
...
[2013-03-07 17:00:46.975364] D [afr-self-heal-metadata.c:267:afr_sh_metadata_sync] 0-gv3-replicate-0: self-healing metadata of <gfid:78a027b5-dae3-4851-904c-926034438fb4> from gv3-client-1 to gv3-client-0

but then see the same failure to get non-blocking locks:

[2013-03-07 17:00:46.975849] D [afr-self-heal-data.c:1134:afr_sh_data_post_nonblocking_inodelk_cbk] 0-gv3-replicate-0: Non Blocking data inodelks failed for <gfid:78a027b5-dae3-4851-904c-926034438fb4>. by bc60dc7ce87f0000

Interestingly, the pending_matrix shows -2 for the 2 bricks where the data was being synced from (not that I understand the meaning of this):

[2013-03-07 17:00:46.976090] D [afr-self-heal-common.c:2510:afr_sh_erase_pending] 0-gv3-replicate-0: Delta matrix for: 7089dc7ce87f0000
[2013-03-07 17:00:46.976107] D [afr-self-heal-common.c:138:afr_sh_print_pending_matrix] 0-gv3-replicate-0: pending_matrix: [ 0 0 0 ]
[2013-03-07 17:00:46.976121] D [afr-self-heal-common.c:138:afr_sh_print_pending_matrix] 0-gv3-replicate-0: pending_matrix: [ -2 0 0 ]
[2013-03-07 17:00:46.976134] D [afr-self-heal-common.c:138:afr_sh_print_pending_matrix] 0-gv3-replicate-0: pending_matrix: [ -2 0 0 ]

and then proceeds to data check ...

[2013-03-07 17:00:46.977685] D [afr-self-heal-metadata.c:61:afr_sh_metadata_done] 0-gv3-replicate-0: proceeding to data check on <gfid:78a027b5-dae3-4851-904c-926034438fb4>

I'll attach the full log in a bit.
Todd

Comment 9 Todd Stansell 2013-03-07 19:12:44 UTC

Created attachment 706770 [details]
self-heal trace log

Here's the self-heal log with trace enabled I referenced from the previous comment.

Todd

Comment 10 Pranith Kumar K 2013-03-08 03:35:53 UTC

Todd,
     I tried to re-create the issue locally on my machine, but I am not able to :-(. Here are the two things I tried using the test-frame work we have. Please let me know what I am missing.

$CLI is gluster --mode=script
$V0 is name of the volume
$H0 is ip/hostname of the machine
$M0 is the mount point
$B0 is the directory in which bricks reside in the script.

TEST fails if the command fails
EXPECT <expected-value> <command-which-is-supposed-to-give-value>

Test 1:
#!/bin/bash

. $(dirname $0)/../include.rc
. $(dirname $0)/../volume.rc

cleanup;

TEST glusterd
TEST pidof glusterd
TEST $CLI volume create $V0 replica 2 $H0:$B0/${V0}0 $H0:$B0/${V0}1
TEST $CLI volume set $V0 cluster.self-heal-daemon off
TEST $CLI volume start $V0
TEST glusterfs --volfile-id=/$V0 --volfile-server=$H0 $M0 --attribute-timeout=0 --entry-timeout=0

#stat function gives modify time/ access time. We get the time stamp based on type
function get_timestamp {
        local f=$1
        local type=$2
        stat $f | grep $type | grep -v Uid | awk '{print $3}' | cut -f1 -d'.'
}
#kill brick1
TEST kill_brick $V0 $H0 $B0/${V0}1
#file is created only in brick0 healing will happen from brick0 to brick1
echo "abc" > $M0/a
TEST $CLI volume start $V0 force
#wait till brick1 comes up
EXPECT_WITHIN 30 "1" afr_child_up_status $V0 1
#kill brick0
TEST kill_brick $V0 $H0 $B0/${V0}0
#file is created only in brick1 healing will happen from brick1 to brick0
echo "abc" > $M0/b
TEST $CLI volume start $V0 force
#wait till brick0 comes up
EXPECT_WITHIN 30 "1" afr_child_up_status $V0 0

#get time stamps before self-heal
TEST access_tstampa_0=$(get_timestamp $B0/${V0}0/a Access)
TEST modify_tstampa_0=$(get_timestamp $B0/${V0}0/a Modify)
TEST access_tstampb_1=$(get_timestamp $B0/${V0}1/b Access)
TEST modify_tstampb_1=$(get_timestamp $B0/${V0}1/b Modify)
#wait for 2 seconds to see if after healing the time stamps differ or not
sleep 2
#trigger self-heals
find $M0 | xargs stat 1>/dev/null
#get time stamps after self-heal on the bricks they are healed to
TEST access_tstampa_1=$(get_timestamp $B0/${V0}1/a Access)
TEST modify_tstampa_1=$(get_timestamp $B0/${V0}1/a Modify)
TEST access_tstampb_0=$(get_timestamp $B0/${V0}0/b Access)
TEST modify_tstampb_0=$(get_timestamp $B0/${V0}0/b Modify)
#check if the timestamps match or not.
EXPECT $access_tstampa_0 echo $access_tstampa_1
EXPECT $modify_tstampa_0 echo $modify_tstampa_1
EXPECT $access_tstampb_0 echo $access_tstampb_1
EXPECT $modify_tstampb_0 echo $modify_tstampb_1
cleanup;

Test-2:
#!/bin/bash

. $(dirname $0)/../include.rc
. $(dirname $0)/../volume.rc

cleanup;

#Simulates a drive replacement by deleting file and its gfid-link and then
# check if the timestamps match after self-heal
TEST glusterd
TEST pidof glusterd
TEST $CLI volume create $V0 replica 2 $H0:$B0/${V0}0 $H0:$B0/${V0}1
TEST $CLI volume set $V0 cluster.self-heal-daemon off
TEST $CLI volume start $V0
TEST glusterfs --volfile-id=/$V0 --volfile-server=$H0 $M0 --attribute-timeout=0 --entry-timeout=0

function get_timestamp {
        local f=$1
        local type=$2
        stat $f | grep $type | grep -v Uid | awk '{print $3}' | cut -f1 -d'.'
}
echo "abc" > $M0/a
echo "abc" > $M0/b

TEST kill_brick $V0 $H0 $B0/${V0}0
#deletes both the file and its gfid link from .glusterfs
TEST gf_rm_file_and_gfid_link $B0/${V0}0/a
TEST gf_rm_file_and_gfid_link $B0/${V0}0/b
TEST access_tstampa_1=$(get_timestamp $B0/${V0}1/a Access)
TEST modify_tstampa_1=$(get_timestamp $B0/${V0}1/a Modify)
TEST access_tstampb_1=$(get_timestamp $B0/${V0}1/b Access)
TEST modify_tstampb_1=$(get_timestamp $B0/${V0}1/b Modify)
TEST $CLI volume start $V0 force
EXPECT_WITHIN 30 "1" afr_child_up_status $V0 0
sleep 2
find $M0 | xargs stat 1>/dev/null
TEST access_tstampa_0=$(get_timestamp $B0/${V0}0/a Access)
TEST modify_tstampa_0=$(get_timestamp $B0/${V0}0/a Modify)
TEST access_tstampb_0=$(get_timestamp $B0/${V0}0/b Access)
TEST modify_tstampb_0=$(get_timestamp $B0/${V0}0/b Modify)
EXPECT $access_tstampa_0 echo $access_tstampa_1
EXPECT $modify_tstampa_0 echo $modify_tstampa_1
EXPECT $access_tstampb_0 echo $access_tstampb_1
EXPECT $modify_tstampb_0 echo $modify_tstampb_1
cleanup;

Comment 11 Todd Stansell 2013-03-08 04:51:38 UTC

Here's the 2-brick test case:

US-CA1 host13 (root) /var/log/glusterfs:$ mkdir /gfs0/brick0{1,2}
US-CA1 host13 (root) /var/log/glusterfs:$ gluster volume create gv01 replica 2 transport tcp host13:/gfs0/brick01 host13:/gfs0/brick02
Multiple bricks of a replicate volume are present on the same server. This setup is not optimal.
Do you still want to continue creating the volume?  (y/n) y
volume create: gv01: success: please start the volume to access data
US-CA1 host13 (root) /var/log/glusterfs:$ gluster volume set gv01 nfs.disable on
volume set: success
US-CA1 host13 (root) /var/log/glusterfs:$ gluster volume start gv01
volume start: gv01: success
US-CA1 host13 (root) /var/log/glusterfs:$ mount -t glusterfs host13:/gv01 /gv01
US-CA1 host13 (root) /var/log/glusterfs:$ cp -p /etc/init.d/glusterd /gv01
US-CA1 host13 (root) /var/log/glusterfs:$ ls -la --full-time /gfs0/brick0?/g*
-rwxr-xr-x 2 root root 1979 2013-02-08 13:32:41.000000000 +0000 /gfs0/brick01/glusterd
-rwxr-xr-x 2 root root 1979 2013-02-08 13:32:41.000000000 +0000 /gfs0/brick02/glusterd
US-CA1 host13 (root) /var/log/glusterfs:$ ps -ef | grep brick01
root      4926     1  0 04:48 ?        00:00:00 /usr/sbin/glusterfsd -s localhost --volfile-id gv01.host13.gfs0-brick01 -p /var/lib/glusterd/vols/gv01/run/host13-gfs0-brick01.pid -S /var/run/a7c26a6374c851d7baf4c925dfd87d0f.socket --brick-name /gfs0/brick01 -l /var/log/glusterfs/bricks/gfs0-brick01.log --xlator-option *-posix.glusterd-uuid=860e1ecd-78c7-4a99-a642-e8eac0919a56 --brick-port 49159 --xlator-option gv01-server.listen-port=49159
root      5012  7522  0 04:48 pts/0    00:00:00 grep brick01
US-CA1 host13 (root) /var/log/glusterfs:$ kill 4926
US-CA1 host13 (root) /var/log/glusterfs:$ rm -f /gfs0/brick01/glusterd /gfs0/brick01/.glusterfs/
00/       b5/       indices/  landfill/
US-CA1 host13 (root) /var/log/glusterfs:$ rm -f /gfs0/brick01/glusterd /gfs0/brick01/.glusterfs/b5/5b/b55b1a9e-0ebd-4d8a-8837-3f5d1c9977be
US-CA1 host13 (root) /var/log/glusterfs:$ ls -la --full-time /gfs0/brick0?/g*
-rwxr-xr-x 2 root root 1979 2013-02-08 13:32:41.000000000 +0000 /gfs0/brick02/glusterd
US-CA1 host13 (root) /var/log/glusterfs:$ gluster volume start gv01 force
volume start: gv01: success
US-CA1 host13 (root) /var/log/glusterfs:$ ls -la --full-time /gfs0/brick0?/g*
-rwxr-xr-x 2 root root 1979 2013-02-08 13:32:41.000000000 +0000 /gfs0/brick02/glusterd
US-CA1 host13 (root) /var/log/glusterfs:$ gluster volume heal gv01 full
Launching Heal operation on volume gv01 has been successful
Use heal info commands to check status
US-CA1 host13 (root) /var/log/glusterfs:$ ls -la --full-time /gfs0/brick0?/g*
-rwxr-xr-x 2 root root 1979 2013-03-08 04:49:49.645296000 +0000 /gfs0/brick01/glusterd
-rwxr-xr-x 2 root root 1979 2013-03-08 04:49:49.645296000 +0000 /gfs0/brick02/glusterd

Todd

Comment 12 Pranith Kumar K 2013-03-12 09:48:19 UTC

http://review.gluster.org/4658

Comment 13 Vijay Bellur 2013-03-12 15:09:07 UTC

CHANGE: http://review.gluster.org/4658 (cluster/afr: Preserve mtime in self-heal) merged in master by Vijay Bellur (vbellur)

Comment 14 Todd Stansell 2013-03-12 17:09:37 UTC

Will this fix be backported to 3.3.1? or only available in some future 3.4 build?  I'm looking for the most stable way to be able to utilize this soon on Centos 6.

Todd

Comment 15 Todd Stansell 2013-03-12 21:48:31 UTC

Given that this is a fix for data corruption, I'm officially asking for this to get backported to at least the last stable release, 3.3.1.  I'm not sure how to officially request this, but it doesn't make sense to force people to upgrade to 3.4 to get the fix.

Todd

Comment 16 Joe Julian 2013-03-12 21:51:15 UTC

+1. Please backport bugfixes to 3.3.

Comment 17 Pranith Kumar K 2013-03-13 09:04:47 UTC

(In reply to comment #15)
> Given that this is a fix for data corruption, I'm officially asking for this
> to get backported to at least the last stable release, 3.3.1.  I'm not sure
> how to officially request this, but it doesn't make sense to force people to
> upgrade to 3.4 to get the fix.
> 
> Todd

 right now requesting back ports is ad hoc, but we'll document a process for it.
keep an eye on gluster-users.

http://review.gluster.com/4663
http://review.gluster.com/4664

Those are the backport patches for 3.4 and 3.3

Pranith

Comment 18 Todd Stansell 2013-03-15 16:01:25 UTC

Thanks for the backports.  Any idea how long it might take to get reviews for these?  We're trying to use 3.3.1, but without this patch, we can't feel confident and this piece of our project is on hold.  So, I'm just curious how long I should expect the backport review to take so I can prepare.  I know I could build it myself, but I'd much rather have a package that's built using kkeithley's standard build options/environment.  Therefore, I'm waiting for this to go in.

Thanks for the great turnaround on this bugfix, by the way!

Todd

Comment 19 Anand Avati 2013-04-10 07:41:50 UTC

COMMIT: http://review.gluster.org/4664 committed in release-3.3 by Vijay Bellur (vbellur) 
------
commit a216f5f44675bfe189c318171dbc50e1c19bfc26
Author: Anand Avati <avati>
Date:   Tue Apr 9 17:22:01 2013 -0700

    tests: fix further issues with bug-874498.t
    
    The failure of bug-874498.t seems to be a "bug" in glustershd.
    The situation seems to be when both subvolumes of a replica are
    "local" to glustershd, and in such cases glustershd is sensitive
    to the order in which the subvols come up.
    
    The core of the issue itself is that, without the patch (#4784),
    self-heal daemon completes the processing of index and no entries
    are left inside the xattrop index after a few seconds of volume
    start force. However with the patch, the stale "backing file"
    (against which index performs link()) is left. The likely reason
    is that an "INDEX" based crawl is not happening against the subvol
    when this patch is applied.
    
    Before #4784 patch, the order in which subvols came up was :
    
      [2013-04-09 22:55:35.117679] I [client-handshake.c:1456:client_setvolume_cbk] 0-patchy-client-0: Connected to 10.3.129.13:49156, attached to remote volume '/d/backends/brick1'.
      ...
      [2013-04-09 22:55:35.118399] I [client-handshake.c:1456:client_setvolume_cbk] 0-patchy-client-1: Connected to 10.3.129.13:49157, attached to remote volume '/d/backends/brick2'.
    
    However, with the patch, the order is reversed:
    
      [2013-04-09 22:53:34.945370] I [client-handshake.c:1456:client_setvolume_cbk] 0-patchy-client-1: Connected to 10.3.129.13:49153, attached to remote volume '/d/backends/brick2'.
      ...
      [2013-04-09 22:53:34.950966] I [client-handshake.c:1456:client_setvolume_cbk] 0-patchy-client-0: Connected to 10.3.129.13:49152, attached to remote volume '/d/backends/brick1'.
    
    The index in brick2 has the list of files/gfid to heal. It appears
    to be the case that when brick1 is the first subvol to be detected
    as coming up, somehow an INDEX based crawl is clearing all the
    index entries in brick2, but if brick2 comes up as the first subvol,
    then the backing file is left stale.
    
    Also, doing a "gluster volume heal full" seems to leave out stale
    backing files too. As the crawl is performed on the namespace and
    the backing file is never encountered there to get cleared out.
    
    So the interim (possibly permanent) fix is to have the script issue
    a regular self-heal command (and not a "full" one).
    
    The failure of the script itself is non-critical. The data files are
    all healed, and it is just the backing file which is left behind. The
    stale backing file too gets cleared in the next index based healing,
    either triggered manually or after 10mins.
    
    Change-Id: I5deb79652ef449b7e88684311e804a8a2aa4725d
    BUG: 874498
    Signed-off-by: Anand Avati <avati>
    Reviewed-on: http://review.gluster.org/4798
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Jeff Darcy <jdarcy>

Comment 20 Anand Avati 2013-04-12 07:23:07 UTC

COMMIT: http://review.gluster.org/4663 committed in release-3.4 by Anand Avati (avati) 
------
commit 77a02c4dd8467e2a78b3ab3cdef95178ef4b1898
Author: Kaleb S. KEITHLEY <kkeithle>
Date:   Thu Apr 11 09:36:35 2013 -0400

    object-storage: rebase Swift to 1.8.0 (grizzly)
    
    Two minor tweaks found while packaging 3.4.0-0.1.alpha2 for Fedora 19
    
    BUG: 948039
    Change-Id: I97175636164702cf4042bc4a18ffead76ad386cb
    Signed-off-by: Kaleb S. KEITHLEY <kkeithle>
    Reviewed-on: http://review.gluster.org/4807
    Reviewed-by: Jeff Darcy <jdarcy>
    Tested-by: Gluster Build System <jenkins.com>