Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 764654 (GLUSTER-2922)

Summary:

Getting zero-byte files in a replicated folder

Product:

[Community] GlusterFS

Reporter:

bernard grymonpon <bernard>

Component:

replicate

Assignee:

Pranith Kumar K <pkarampu>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

Severity:

medium

Docs Contact:

Priority:

urgent

Version:

3.2.0

CC:

gluster-bugs, jdarcy, mohitanchlia, sconstro, vbellur, vijay

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2012-12-11 05:05:42 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
configs of the storage servers	none
logfile of problematic client	none
read-dir should happen from the configured read-child	none
read-child should be sent based on the configured read-child	none

Description bernard grymonpon 2011-05-20 22:09:24 UTC

I get on one client inconsistent readings when listing the files in a folder on a replicated gluster. I get a) listings with 0-byte filesizes in one folder and b) stuck reads on the filesystem.

Situation: A two-brick replicated gluster (was 3.1.x, upgraded to 3.2) with 10 clients. Everything on Debian Lenny, running gluster 3.2 compiled from source. Everything is wired together on Gig network.

All but one client works okay. The stuck file reads we've seen on client code 3.1.4 and 3.2. The zero-byte issues not sure, but probably only on 3.2 client. Server was always 3.2.

In the client log, we get a lot of this, repeating every second:
[2011-05-18 13:45:17.320991] I [afr-common.c:575:afr_lookup_collect_xattr] 0-export-replicate-0: entry self-heal is pending for /some/path/to/problem/folder.

(always the same folder!!)

I also found some of these between the above ones:

[2011-05-18 13:38:47.671971] E [rpc-clnt.c:197:call_bail] 0-export-client-1: bailing out frame type(GlusterFS 3.1) op(FINODELK(30)) xid = 0x
382478x sent = 2011-05-18 13:08:38.260802. timeout = 1800
[2011-05-18 13:38:47.672035] I [client3_1-fops.c:1264:client3_1_finodelk_cbk] 0-export-client-1: remote operation failed: Transport endpoint
 is not connected
[2011-05-18 13:38:47.672061] I [afr-lk-common.c:962:afr_lock_blocking] 0-export-replicate-0: unable to lock on even one child
[2011-05-18 13:38:47.672081] I [afr-transaction.c:945:afr_post_blocking_inodelk_cbk] 0-export-replicate-0: Blocking inodelks failed.
[2011-05-18 13:38:47.672105] W [fuse-bridge.c:844:fuse_err_cbk] 0-glusterfs-fuse: 512853: FLUSH() ERR => -1 (Resource temporarily unavailabl
e)

No idea if this is related. The machines should always be able to find each other, this all runs on a private 1 Gb network on quality hardware. I never experienced connectivity issues...

I did also sometimes see 0-byte files on brick2 when listing the folder content on the brick itself. These changed back to normal size after a while.

We solved it by shutting brick 2 completely, and everything snapped back to normal instantly.

Discussed this also with jdarcy and mo__ on IRC. Feel free to contact me to get more info on this if needed.

Comment 1 Pranith Kumar K 2011-05-20 22:41:27 UTC

hi Bernard,
        Could you please attach the client log file of the client which is giving problems. Tell me the timestamp from where the problem you are facing has started. Are you saying that the client which is giving problems is 3.2?. All the rest of them are 3.1.4?.

I will also need volume info output for that volume to map between logs and the actual bricks.

What are the xattrs of the directory which says entry self-heal pending for all the replicas. please provide the output of getxattr -d -m "trusted*" <filename> on all the bricks.

were you doing the upgrade at 2011-05-18 13:38?. When the upgrade is done the bricks are restarted, so at that time the inodelks could have failed.

Pranith.

Comment 2 bernard grymonpon 2011-05-21 05:13:34 UTC

(In reply to comment #1)
> hi Bernard,
>         Could you please attach the client log file of the client which is
> giving problems. Tell me the timestamp from where the problem you are facing
> has started. Are you saying that the client which is giving problems is 3.2?.
> All the rest of them are 3.1.4?.
> 
> I will also need volume info output for that volume to map between logs and the
> actual bricks.
> 
> What are the xattrs of the directory which says entry self-heal pending for all
> the replicas. please provide the output of getxattr -d -m "trusted*" <filename>
> on all the bricks.
> 
> were you doing the upgrade at 2011-05-18 13:38?. When the upgrade is done the
> bricks are restarted, so at that time the inodelks could have failed.
> 
> Pranith.

The logs and configs are attached in the zip. For the other questions:

- upgrade was done a couple of hours before, the entry is from one of the last times we've seen it (just picked a random one). The exact timestamp of the start of the problem I will not know exactly, but it should be somewhere in the morning of may 18th. 
- we had these problems before (in 3.1.x release, in march and april), had to shut down the second brick completely and eventually, we started of with an empty brick2 after the upgrade to 3.2. I first let it settle a bit with only allowing one client access to the second brick, and did a "find | stat" on the mounted volume during the night. It replicated all the data nicely (I only did the "find" on certain parts (the critical data); the folder that is giving trouble is not amongst this "critical data").
- The clients are a mix of 3.2 and 3.1.2; the problematic client first was 3.1.4, and later we upgraded to 3.2 (we hoped this would solve our problem).
- for a while, we firewalled the problematic client from brick2, as this seemed to solve our problems for a while. In the end, only shutting the brick completely helped.

A couple of words on the two logs:

- i've replaced some pathnames with some obfuscated paths, but I kept it consistant. I'll be able to map that to the real paths, just some search/replacing. I have to ensure my clients privacy.

- you'll find two logs. The system normally is a Ubuntu Feisty installation (2007) which is running the last bits of the project (which is under upgrade). All but one other client are debian lenny. In a desperate attempt (we suspected libc6 or some other global lib), we deployed a new client (lenny), redeployed the client software over there, and mounted the gluster there. Exactly the same happened! There is "feisty....log" and "lenny....log". So, actually, the lenny....log is a clean setup, and it starts failing right away.

- The real trouble started somewhere around noon (12:00:00) on may 18th and went on the whole afternoon (we remounted, killed, firewalled, unfirewalled... a lot).

- You'll notice in the feisty....log that it goes a way back, and that we had this before (and other problems). We had to stop brick2 to fix this. 

It kind of makes me wonder if it isn't the combination of the ancient java1.5 based software and gluster that breaks. We have one other client running feisty, no problem at all. Even the switch to recent, decent install (lenny) didn't fix the problems. 

You'll find some crazy stuff in the logfile, sorry for that, but we tried a lot of other stuff. All the transport endpoints not connected stuff is us re-applying the firewalling to the second node (only dependant on the first node), as this seemed to stablize it a bit, but not completetly. 

And for the dirinfo - where can I get the getxattr command?

Comment 3 bernard grymonpon 2011-05-21 05:20:24 UTC

Created attachment 494

Comment 4 bernard grymonpon 2011-05-21 05:21:49 UTC

Created attachment 495


This is the log of the freshly installed machine, with the java 1.5 application deployed to it; it starts failing right away.

Comment 5 bernard grymonpon 2011-05-21 05:30:34 UTC

And the bigger logfile is here: http://dl.dropbox.com/u/20813/feisty-dist-new_storage.log.bz2

Comment 6 Jeff Darcy 2011-05-23 12:12:46 UTC

FWIW, I tried to reproduce what seemed like the pertinent aspects of this, forcing directory self-heal etc., against 3.2git (but fresh install, not upgraded) as of Friday.  I was unable to reproduce the zero-length-file problem.

Comment 7 mohitanchlia 2011-05-23 13:42:51 UTC

I've seen other people report the same thing. Not sure if this is 3.2 related or has been there before or is really because of some upgrade inconsistency. It will be good to get bottom of this issue.

From what I understand that remounting client solves the issue and then it appears again after few hours.

Comment 8 bernard grymonpon 2011-05-23 14:13:34 UTC

Actually, as told in my comments, we've also seen this happen on previous versions (3.1.x). The logfile still contains a chunk of this event also. We had to shut the second brick then also to regain access and a workable situation.

More info on the actual problem/setup: we started seeing in-accesibility on other client nodes as from yesterday. No zero-byte files have been seen, but I haven't been able to verify this, as I don't know exactly which folders are being accessed (these clients have a very random access pattern, I can't predict where it would be stuck). Killing glusterfs and remounting on the client is the only way to get back to a working state. We have it on one specific client, while the others are performing just normal.

The logs are full of "pending self-heal" (which is crazy, as the second brick is completely offline, it should not try to self-heal, as it has nothing to compare to).

Comment 9 Pranith Kumar K 2011-05-26 07:46:18 UTC

hi,
  In the logfile of problematic client, I see the following log:
[2011-05-18 21:03:39.421619] E [socket.c:1685:socket_connect_finish] 0-export-client-0: connection to  failed (Connection refused)

The client is not able to connect to the brick 192.168.1.51:/store/export
So the writes that happen through this client are going to be written only to the other-brick (192.168.1.50:/store/export). So there will always be pending self-heal messages in this log.

When the files written by the problemaic-client are accessed from other clients they would trigger the self-heals(Pending-self-heal logs will again come), Since those clients are connected successfully to the brick (192.168.1.51:/store/export), so the zero-size files would be healed properly. I think this is the reason for intermittent zero-size files.

Could you please fix the connectivity problem.

Pranith.

Comment 10 Pranith Kumar K 2011-05-26 08:17:49 UTC

I looked at the logs without looking at your comment about stopping the brick2.
Is brick2 192.168.1.51:/store/export?.

The pending self-heal log will always come until it is self-healed properly. The client does not need to contact the other server to decide on self-heal. The xattrs of the file on the brick that is up, tell the client if a self-heal is needed.

For getxattr install attr using "sudo apt-get install attr".

I was not sure, what you meant by "lenny client failing right away". If it is just the pending-self-heal logs, you can safely ignore them in this scenario as the other brick is not up and self-heals are not happening.

May be we should consider decreasing the log-level for that log to DEBUG.

Pranith.

Comment 11 bernard grymonpon 2011-05-26 08:46:18 UTC

"brick2" is indeed the 192.168.1.51 machine. Note however that gluster swapped the names (export-0 is .51 and export-1 is .50). That machine was indeed first partially firewalled from the clients (but not from the other server), and eventually, glusterd and glusterfsd were stopped on that machine.

When saying "Lenny machine failing right away" I mean that when I online the application on that client, the same symptoms are seen (zero-byte files visible, locking of processes).

Below you can find the attributes you requested. Note however that these files have changed already on the online client, as these are rewritten sometimes. I took 3 files, if you need more, let me know.

Good machine (the one currently online (brick 1, .50 machine)):

# file: dyna_menuF_Ver_FR.html
trusted.afr.export-client-0=0sAAAAAQAAAAAAAAAA
trusted.afr.export-client-1=0sAAAAAAAAAAAAAAAA
trusted.gfid=0sXNTho9F8Qbilj74frubVvQ==

# file: dyna_menuF_Ver_NL.html
trusted.afr.export-client-0=0sAAAAAQAAAAAAAAAA
trusted.afr.export-client-1=0sAAAAAAAAAAAAAAAA
trusted.gfid=0sk3MNqJfjQqW/5DSgYne2ww==

# file: dyna_menu_Hor_FR.html
trusted.afr.export-client-0=0sAAAAGwAAAAAAAAAA
trusted.afr.export-client-1=0sAAAAAAAAAAAAAAAA
trusted.gfid=0sher8EbBDS2e0DnuYw8JPUQ==

# file: dyna_menu_Hor_NL.html
trusted.afr.export-client-0=0sAAAAGgAAAAAAAAAA
trusted.afr.export-client-1=0sAAAAAAAAAAAAAAAA
trusted.gfid=0s2xKBrcRRRRuPsUbNnZGTQA==

BAd (but currently offlined (brick 2, 51 machine)):

# file: dyna_menuF_Ver_FR.html
trusted.afr.export-client-0=0sAAAAAAAAAAAAAAAA
trusted.afr.export-client-1=0sAAAAAAAAAAAAAAAA
trusted.gfid=0sQiYoJnYCSDiblQgs9PGwyA==

# file: dyna_menuF_Ver_NL.html
trusted.afr.export-client-0=0sAAAAAAAAAAAAAAAA
trusted.afr.export-client-1=0sAAAAAAAAAAAAAAAA
trusted.gfid=0sbhCETMn5StuGGYgd/4RxQQ==

# file: dyna_menu_Hor_FR.html
trusted.afr.export-client-0=0sAAAAAAAAAAAAAAAA
trusted.afr.export-client-1=0sAAAAAAAAAAAAAAAA
trusted.gfid=0s0k0h+fneQrC3yTAO9L9yXQ==

# file: dyna_menu_Hor_NL.html
trusted.afr.export-client-0=0sAAAAAAAAAAAAAAAA
trusted.afr.export-client-1=0sAAAAAAAAAAAAAAAA
trusted.gfid=0s6AsxqsNbTpyUqC2sA5VZ4Q==

Comment 12 Pranith Kumar K 2011-06-01 07:07:23 UTC

hi bernard,
   I see the files have different gfids on each of the bricks.
example: 
dyna_menuF_Ver_FR.html
On good brick: trusted.gfid=0sXNTho9F8Qbilj74frubVvQ==
On bad brick:  trusted.gfid=0sQiYoJnYCSDiblQgs9PGwyA==

This could be a split-brain. We need to find if the directory which contains these files have pending entry xattrs. just give me the output of 
getfattr -d -m . -e hex <parent-directory> I will tell you if it has pending entry xattrs.

Are you sure if the permissions of these files and the directory containing them are same on both the replicas?.

My guess for the empty files in the ls output: for some reason the writes to the bad-brick are not happening so the files are always empty. By default ls lists files from the first brick that is running in the replica pair, in your case it is the bad-brick.
   We need to find out why writes are not happening on the bad-brick to find the root cause.

Comment 13 mohitanchlia 2011-06-01 13:52:54 UTC

(In reply to comment #12)
> hi bernard,
>    I see the files have different gfids on each of the bricks.
> example: 
> dyna_menuF_Ver_FR.html
> On good brick: trusted.gfid=0sXNTho9F8Qbilj74frubVvQ==
> On bad brick:  trusted.gfid=0sQiYoJnYCSDiblQgs9PGwyA==
> This could be a split-brain. We need to find if the directory which contains
> these files have pending entry xattrs. just give me the output of 
> getfattr -d -m . -e hex <parent-directory> I will tell you if it has pending
> entry xattrs.
> Are you sure if the permissions of these files and the directory containing
> them are same on both the replicas?.
> My guess for the empty files in the ls output: for some reason the writes to
> the bad-brick are not happening so the files are always empty. By default ls
> lists files from the first brick that is running in the replica pair, in your
> case it is the bad-brick.
>    We need to find out why writes are not happening on the bad-brick to find
> the root cause.

Also, why would "ls" result in displaying to files with same name? How does it really works and why it fails in some cases?

Comment 14 bernard grymonpon 2011-06-01 16:19:31 UTC

(In reply to comment #12)
> hi bernard,
>    I see the files have different gfids on each of the bricks.
> example: 
> dyna_menuF_Ver_FR.html
> On good brick: trusted.gfid=0sXNTho9F8Qbilj74frubVvQ==
> On bad brick:  trusted.gfid=0sQiYoJnYCSDiblQgs9PGwyA==
> 
> This could be a split-brain. We need to find if the directory which contains
> these files have pending entry xattrs. just give me the output of 
> getfattr -d -m . -e hex <parent-directory> I will tell you if it has pending
> entry xattrs.

Good (currently active) one (export-1):

# file: myfolders
trusted.afr.export-client-0=0x0000000000000000000065ac
trusted.afr.export-client-1=0x000000000000000000000000
trusted.gfid=0xa2287c869330432297001de77feb7766

Bad (currently disable) one (export-0):

# file: myfolders
trusted.afr.export-client-0=0x000000000000000000000000
trusted.afr.export-client-1=0x000000000000000000000000
trusted.gfid=0xa2287c869330432297001de77feb7766


> 
> Are you sure if the permissions of these files and the directory containing
> them are same on both the replicas?.

Not any more, as I've stopped gluster on the second brick (export-0).


> 
> My guess for the empty files in the ls output: for some reason the writes to
> the bad-brick are not happening so the files are always empty. By default ls
> lists files from the first brick that is running in the replica pair, in your
> case it is the bad-brick.
>    We need to find out why writes are not happening on the bad-brick to find
> the root cause.

Comment 15 bernard grymonpon 2011-06-01 16:21:44 UTC

(In reply to comment #13)
> (In reply to comment #12)
> > hi bernard,
> >    I see the files have different gfids on each of the bricks.
> > example: 
> > dyna_menuF_Ver_FR.html
> > On good brick: trusted.gfid=0sXNTho9F8Qbilj74frubVvQ==
> > On bad brick:  trusted.gfid=0sQiYoJnYCSDiblQgs9PGwyA==
> > This could be a split-brain. We need to find if the directory which contains
> > these files have pending entry xattrs. just give me the output of 
> > getfattr -d -m . -e hex <parent-directory> I will tell you if it has pending
> > entry xattrs.
> > Are you sure if the permissions of these files and the directory containing
> > them are same on both the replicas?.
> > My guess for the empty files in the ls output: for some reason the writes to
> > the bad-brick are not happening so the files are always empty. By default ls
> > lists files from the first brick that is running in the replica pair, in your
> > case it is the bad-brick.
> >    We need to find out why writes are not happening on the bad-brick to find
> > the root cause.
> 
> Also, why would "ls" result in displaying to files with same name? How does it
> really works and why it fails in some cases?

Where do you see two times the same file? The names are similar, but there is a difference in the language part (FR and NL).

Comment 16 Pranith Kumar K 2011-06-02 02:48:27 UTC

(In reply to comment #14)
> (In reply to comment #12)
> > hi bernard,
> >    I see the files have different gfids on each of the bricks.
> > example: 
> > dyna_menuF_Ver_FR.html
> > On good brick: trusted.gfid=0sXNTho9F8Qbilj74frubVvQ==
> > On bad brick:  trusted.gfid=0sQiYoJnYCSDiblQgs9PGwyA==
> > 
> > This could be a split-brain. We need to find if the directory which contains
> > these files have pending entry xattrs. just give me the output of 
> > getfattr -d -m . -e hex <parent-directory> I will tell you if it has pending
> > entry xattrs.
> 
> Good (currently active) one (export-1):
> 
> # file: myfolders
> trusted.afr.export-client-0=0x0000000000000000000065ac
> trusted.afr.export-client-1=0x000000000000000000000000
> trusted.gfid=0xa2287c869330432297001de77feb7766
> 
> Bad (currently disable) one (export-0):
> 
> # file: myfolders
> trusted.afr.export-client-0=0x000000000000000000000000
> trusted.afr.export-client-1=0x000000000000000000000000
> trusted.gfid=0xa2287c869330432297001de77feb7766
> 
> 
> > 
> > Are you sure if the permissions of these files and the directory containing
> > them are same on both the replicas?.
> 
> Not any more, as I've stopped gluster on the second brick (export-0).
> 
> 
> > 
> > My guess for the empty files in the ls output: for some reason the writes to
> > the bad-brick are not happening so the files are always empty. By default ls
> > lists files from the first brick that is running in the replica pair, in your
> > case it is the bad-brick.
> >    We need to find out why writes are not happening on the bad-brick to find
> > the root cause.

Bernard,
       The reason you are facing problems is because replicate performs reads from the first brick that is up, and in your set up the first brick in the replica-pair is the bad one i.e. it needs to be self-healed.

We can fix it. Perform this task when the activity on the mount point is low, because self-heal can cause performance hit based on the size of the files that need to be self-healed.

first set the preferred read-subvolume to be the good-brick using the following command. 

1) # gluster volume set export cluster.read-subvolume export-client-1

This will make replicate to perform reads from the currently active brick (we were referring to that brick as the "good-brick"). ls wont show 0 size files after you do this even if the bad brick is up.

2) Empty the bad brick.
3) Bring the bad brick up using the command.
# gluster volume start export force

4) From one of the clients trigger self-heal using 
# find <gluster-mount> -print0 | xargs --null stat >/dev/null .

Let me know if you face any problems.

Comment 17 mohitanchlia 2011-06-02 13:20:13 UTC

(In reply to comment #16)
> (In reply to comment #14)
> > (In reply to comment #12)
> > > hi bernard,
> > >    I see the files have different gfids on each of the bricks.
> > > example: 
> > > dyna_menuF_Ver_FR.html
> > > On good brick: trusted.gfid=0sXNTho9F8Qbilj74frubVvQ==
> > > On bad brick:  trusted.gfid=0sQiYoJnYCSDiblQgs9PGwyA==
> > > 
> > > This could be a split-brain. We need to find if the directory which contains
> > > these files have pending entry xattrs. just give me the output of 
> > > getfattr -d -m . -e hex <parent-directory> I will tell you if it has pending
> > > entry xattrs.
> > 
> > Good (currently active) one (export-1):
> > 
> > # file: myfolders
> > trusted.afr.export-client-0=0x0000000000000000000065ac
> > trusted.afr.export-client-1=0x000000000000000000000000
> > trusted.gfid=0xa2287c869330432297001de77feb7766
> > 
> > Bad (currently disable) one (export-0):
> > 
> > # file: myfolders
> > trusted.afr.export-client-0=0x000000000000000000000000
> > trusted.afr.export-client-1=0x000000000000000000000000
> > trusted.gfid=0xa2287c869330432297001de77feb7766
> > 
> > 
> > > 
> > > Are you sure if the permissions of these files and the directory containing
> > > them are same on both the replicas?.
> > 
> > Not any more, as I've stopped gluster on the second brick (export-0).
> > 
> > 
> > > 
> > > My guess for the empty files in the ls output: for some reason the writes to
> > > the bad-brick are not happening so the files are always empty. By default ls
> > > lists files from the first brick that is running in the replica pair, in your
> > > case it is the bad-brick.
> > >    We need to find out why writes are not happening on the bad-brick to find
> > > the root cause.
> Bernard,
>        The reason you are facing problems is because replicate performs reads
> from the first brick that is up, and in your set up the first brick in the
> replica-pair is the bad one i.e. it needs to be self-healed.
> We can fix it. Perform this task when the activity on the mount point is low,
> because self-heal can cause performance hit based on the size of the files that
> need to be self-healed.
> first set the preferred read-subvolume to be the good-brick using the following
> command. 
> 1) # gluster volume set export cluster.read-subvolume export-client-1
> This will make replicate to perform reads from the currently active brick (we
> were referring to that brick as the "good-brick"). ls wont show 0 size files
> after you do this even if the bad brick is up.
> 2) Empty the bad brick.
> 3) Bring the bad brick up using the command.
> # gluster volume start export force
> 4) From one of the clients trigger self-heal using 
> # find <gluster-mount> -print0 | xargs --null stat >/dev/null .
> Let me know if you face any problems.

And why would the brick be bad? Also, from what I remember Bernard reported on IIRC that after umount and mount everything runs ok for a while but agains goes bad after a while.

Comment 18 bernard grymonpon 2011-06-02 14:30:38 UTC

(In reply to comment #17)
> (In reply to comment #16)
> > (In reply to comment #14)
> > > (In reply to comment #12)
> > > > hi bernard,

<snip>

> >        The reason you are facing problems is because replicate performs reads
> > from the first brick that is up, and in your set up the first brick in the
> > replica-pair is the bad one i.e. it needs to be self-healed.
> > We can fix it. Perform this task when the activity on the mount point is low,
> > because self-heal can cause performance hit based on the size of the files that
> > need to be self-healed.
> > first set the preferred read-subvolume to be the good-brick using the following
> > command. 
> > 1) # gluster volume set export cluster.read-subvolume export-client-1
> > This will make replicate to perform reads from the currently active brick (we
> > were referring to that brick as the "good-brick"). ls wont show 0 size files
> > after you do this even if the bad brick is up.
> > 2) Empty the bad brick.
> > 3) Bring the bad brick up using the command.
> > # gluster volume start export force
> > 4) From one of the clients trigger self-heal using 
> > # find <gluster-mount> -print0 | xargs --null stat >/dev/null .
> > Let me know if you face any problems.
> 
> And why would the brick be bad? Also, from what I remember Bernard reported on
> IIRC that after umount and mount everything runs ok for a while but agains goes
> bad after a while.

Correct, but anything will do now, as my client is getting a bit restless. I'll try to fix this, and I'll hope that the reason everything went haywire was me making a mistake somehow.

When you state "empty the bad brick" - I assume that I should just "rm -Rf" the complete content of the brick? Maybe even do a re-mkfs.xfs or mkfs.ext3 on it?

Comment 19 Pranith Kumar K 2011-06-03 00:50:39 UTC

(In reply to comment #18)
> (In reply to comment #17)
> > (In reply to comment #16)
> > > (In reply to comment #14)
> > > > (In reply to comment #12)
> > > > > hi bernard,
> 
> <snip>
> 
> > >        The reason you are facing problems is because replicate performs reads
> > > from the first brick that is up, and in your set up the first brick in the
> > > replica-pair is the bad one i.e. it needs to be self-healed.
> > > We can fix it. Perform this task when the activity on the mount point is low,
> > > because self-heal can cause performance hit based on the size of the files that
> > > need to be self-healed.
> > > first set the preferred read-subvolume to be the good-brick using the following
> > > command. 
> > > 1) # gluster volume set export cluster.read-subvolume export-client-1
> > > This will make replicate to perform reads from the currently active brick (we
> > > were referring to that brick as the "good-brick"). ls wont show 0 size files
> > > after you do this even if the bad brick is up.
> > > 2) Empty the bad brick.
> > > 3) Bring the bad brick up using the command.
> > > # gluster volume start export force
> > > 4) From one of the clients trigger self-heal using 
> > > # find <gluster-mount> -print0 | xargs --null stat >/dev/null .
> > > Let me know if you face any problems.
> > 
> > And why would the brick be bad? Also, from what I remember Bernard reported on
> > IIRC that after umount and mount everything runs ok for a while but agains goes
> > bad after a while.
> 
> Correct, but anything will do now, as my client is getting a bit restless. I'll
> try to fix this, and I'll hope that the reason everything went haywire was me
> making a mistake somehow.
> 
> When you state "empty the bad brick" - I assume that I should just "rm -Rf" the
> complete content of the brick? Maybe even do a re-mkfs.xfs or mkfs.ext3 on it?

Sure. It is better to use same filesystems on both the bricks to have uniform performance. Just make it empty what command u wanna use is ur choice

Comment 20 Pranith Kumar K 2011-06-06 02:55:11 UTC

(In reply to comment #19)
> (In reply to comment #18)
> > (In reply to comment #17)
> > > (In reply to comment #16)
> > > > (In reply to comment #14)
> > > > > (In reply to comment #12)
> > > > > > hi bernard,
> > 
> > <snip>
> > 
> > > >        The reason you are facing problems is because replicate performs reads
> > > > from the first brick that is up, and in your set up the first brick in the
> > > > replica-pair is the bad one i.e. it needs to be self-healed.
> > > > We can fix it. Perform this task when the activity on the mount point is low,
> > > > because self-heal can cause performance hit based on the size of the files that
> > > > need to be self-healed.
> > > > first set the preferred read-subvolume to be the good-brick using the following
> > > > command. 
> > > > 1) # gluster volume set export cluster.read-subvolume export-client-1
> > > > This will make replicate to perform reads from the currently active brick (we
> > > > were referring to that brick as the "good-brick"). ls wont show 0 size files
> > > > after you do this even if the bad brick is up.
> > > > 2) Empty the bad brick.
> > > > 3) Bring the bad brick up using the command.
> > > > # gluster volume start export force
> > > > 4) From one of the clients trigger self-heal using 
> > > > # find <gluster-mount> -print0 | xargs --null stat >/dev/null .
> > > > Let me know if you face any problems.
> > > 
> > > And why would the brick be bad? Also, from what I remember Bernard reported on
> > > IIRC that after umount and mount everything runs ok for a while but agains goes
> > > bad after a while.
> > 
> > Correct, but anything will do now, as my client is getting a bit restless. I'll
> > try to fix this, and I'll hope that the reason everything went haywire was me
> > making a mistake somehow.
> > 
> > When you state "empty the bad brick" - I assume that I should just "rm -Rf" the
> > complete content of the brick? Maybe even do a re-mkfs.xfs or mkfs.ext3 on it?
> 
> Sure. It is better to use same filesystems on both the bricks to have uniform
> performance. Just make it empty what command u wanna use is ur choice
hi Bernarnd,
       Did you get a chance to perform this?. What is the result?.
Pranith

Comment 21 bernard grymonpon 2011-06-06 03:19:38 UTC

(In reply to comment #20)
> (In reply to comment #19)
> > (In reply to comment #18)
> > > (In reply to comment #17)
> > > > (In reply to comment #16)
> > > > > (In reply to comment #14)
> > > > > > (In reply to comment #12)
> > > > > > > hi bernard,
> > > 
> > > <snip>
> > > 
> > > > >        The reason you are facing problems is because replicate performs reads
> > > > > from the first brick that is up, and in your set up the first brick in the
> > > > > replica-pair is the bad one i.e. it needs to be self-healed.
> > > > > We can fix it. Perform this task when the activity on the mount point is low,
> > > > > because self-heal can cause performance hit based on the size of the files that
> > > > > need to be self-healed.
> > > > > first set the preferred read-subvolume to be the good-brick using the following
> > > > > command. 
> > > > > 1) # gluster volume set export cluster.read-subvolume export-client-1
> > > > > This will make replicate to perform reads from the currently active brick (we
> > > > > were referring to that brick as the "good-brick"). ls wont show 0 size files
> > > > > after you do this even if the bad brick is up.
> > > > > 2) Empty the bad brick.
> > > > > 3) Bring the bad brick up using the command.
> > > > > # gluster volume start export force
> > > > > 4) From one of the clients trigger self-heal using 
> > > > > # find <gluster-mount> -print0 | xargs --null stat >/dev/null .
> > > > > Let me know if you face any problems.
> > > > 
> > > > And why would the brick be bad? Also, from what I remember Bernard reported on
> > > > IIRC that after umount and mount everything runs ok for a while but agains goes
> > > > bad after a while.
> > > 
> > > Correct, but anything will do now, as my client is getting a bit restless. I'll
> > > try to fix this, and I'll hope that the reason everything went haywire was me
> > > making a mistake somehow.
> > > 
> > > When you state "empty the bad brick" - I assume that I should just "rm -Rf" the
> > > complete content of the brick? Maybe even do a re-mkfs.xfs or mkfs.ext3 on it?
> > 
> > Sure. It is better to use same filesystems on both the bricks to have uniform
> > performance. Just make it empty what command u wanna use is ur choice
> hi Bernarnd,
>        Did you get a chance to perform this?. What is the result?.
> Pranith

I did make the change. Not everything went that smooth... A summary.

First of all, i reformatted the bad bricks (export-0) storage point, started gluster, and then did a "start volume export force". As was normal, a certain slowdown on all the nodes, and my renewed node started to grow in used disksize. The logs filled up with heal-messages, all very normal.

However, after a minute or two, I got stuck processes again, waiting for file access. When I visited the mountpoints and did "ls" or "cat <file>" in them, I could wait forever for a response (sometimes it came after 5 minutes, sometimes I cancelled the operation.

At this point, all of this seemed strange, as I suspected to get the same responsiveness as with only one node online, as all read-operations would go to the good node, the one being capable of delivering all data to all clients when the bad brick was offlined. Guess not...

In my setup, there are two busy nodes, and a lot of more dormant nodes. On the two busier nodes, I unmounted the mount (done with gluster fuse before) and remounted with NFS. I expected better performance due to file caching, and after reading some discussions on the mailinglists and IRC channel. 

The NFS mount worked fine, and access was snappy. However, I got several folders where a couple of the files or folders were listed on the client, while they all exist on the good brick. Again, a clear indication that the reading-off-one-brick didn't really work as needed. I double checked the log and the NFS server actually does have the "preferred read child" in it... 

I assumed (wrongly?) that the preferred read child option actually would force each and every read action to go to the good brick, and return good data. Only write operations should go to the other brick.

At that point, I got fed up with it and I killed the second brick, redid the NFS-mounts, and everything is up again, performing as needed. The NFS mounts are working (and I'll keep them like this for now, I have the impression that they are faster). 

I tried enabling the bad brick again on one of least loaded moments on the setup, but the scenario was the same... locked up clients, half-listed directory content, mount-points becoming inaccessible... And I killed the bad brick again.

To be honest, I'm losing my faith in gluster here. My setup is very simple (two storage nodes, simple replication), my workload is not crazy high (10 clients, some apache web stuff, some tomcats, some java processes), my machines are fairly up-to-date (debian lenny), and my team are skilled system engineers. If you want, I'm even willing to give you guys SSH access into my boxes to look around; just get in touch with me.

Comment 22 mohitanchlia 2011-06-06 16:24:28 UTC

(In reply to comment #21)
> (In reply to comment #20)
> > (In reply to comment #19)
> > > (In reply to comment #18)
> > > > (In reply to comment #17)
> > > > > (In reply to comment #16)
> > > > > > (In reply to comment #14)
> > > > > > > (In reply to comment #12)
> > > > > > > > hi bernard,
> > > > 
> > > > <snip>
> > > > 
> > > > > >        The reason you are facing problems is because replicate performs reads
> > > > > > from the first brick that is up, and in your set up the first brick in the
> > > > > > replica-pair is the bad one i.e. it needs to be self-healed.
> > > > > > We can fix it. Perform this task when the activity on the mount point is low,
> > > > > > because self-heal can cause performance hit based on the size of the files that
> > > > > > need to be self-healed.
> > > > > > first set the preferred read-subvolume to be the good-brick using the following
> > > > > > command. 
> > > > > > 1) # gluster volume set export cluster.read-subvolume export-client-1
> > > > > > This will make replicate to perform reads from the currently active brick (we
> > > > > > were referring to that brick as the "good-brick"). ls wont show 0 size files
> > > > > > after you do this even if the bad brick is up.
> > > > > > 2) Empty the bad brick.
> > > > > > 3) Bring the bad brick up using the command.
> > > > > > # gluster volume start export force
> > > > > > 4) From one of the clients trigger self-heal using 
> > > > > > # find <gluster-mount> -print0 | xargs --null stat >/dev/null .
> > > > > > Let me know if you face any problems.
> > > > > 
> > > > > And why would the brick be bad? Also, from what I remember Bernard reported on
> > > > > IIRC that after umount and mount everything runs ok for a while but agains goes
> > > > > bad after a while.
> > > > 
> > > > Correct, but anything will do now, as my client is getting a bit restless. I'll
> > > > try to fix this, and I'll hope that the reason everything went haywire was me
> > > > making a mistake somehow.
> > > > 
> > > > When you state "empty the bad brick" - I assume that I should just "rm -Rf" the
> > > > complete content of the brick? Maybe even do a re-mkfs.xfs or mkfs.ext3 on it?
> > > 
> > > Sure. It is better to use same filesystems on both the bricks to have uniform
> > > performance. Just make it empty what command u wanna use is ur choice
> > hi Bernarnd,
> >        Did you get a chance to perform this?. What is the result?.
> > Pranith
> I did make the change. Not everything went that smooth... A summary.
> First of all, i reformatted the bad bricks (export-0) storage point, started
> gluster, and then did a "start volume export force". As was normal, a certain
> slowdown on all the nodes, and my renewed node started to grow in used
> disksize. The logs filled up with heal-messages, all very normal.
> However, after a minute or two, I got stuck processes again, waiting for file
> access. When I visited the mountpoints and did "ls" or "cat <file>" in them, I
> could wait forever for a response (sometimes it came after 5 minutes, sometimes
> I cancelled the operation.
> At this point, all of this seemed strange, as I suspected to get the same
> responsiveness as with only one node online, as all read-operations would go to
> the good node, the one being capable of delivering all data to all clients when
> the bad brick was offlined. Guess not...
> In my setup, there are two busy nodes, and a lot of more dormant nodes. On the
> two busier nodes, I unmounted the mount (done with gluster fuse before) and
> remounted with NFS. I expected better performance due to file caching, and
> after reading some discussions on the mailinglists and IRC channel. 
> The NFS mount worked fine, and access was snappy. However, I got several
> folders where a couple of the files or folders were listed on the client, while
> they all exist on the good brick. Again, a clear indication that the
> reading-off-one-brick didn't really work as needed. I double checked the log
> and the NFS server actually does have the "preferred read child" in it... 
> I assumed (wrongly?) that the preferred read child option actually would force
> each and every read action to go to the good brick, and return good data. Only
> write operations should go to the other brick.
> At that point, I got fed up with it and I killed the second brick, redid the
> NFS-mounts, and everything is up again, performing as needed. The NFS mounts
> are working (and I'll keep them like this for now, I have the impression that
> they are faster). 
> I tried enabling the bad brick again on one of least loaded moments on the
> setup, but the scenario was the same... locked up clients, half-listed
> directory content, mount-points becoming inaccessible... And I killed the bad
> brick again.
> To be honest, I'm losing my faith in gluster here. My setup is very simple (two
> storage nodes, simple replication), my workload is not crazy high (10 clients,
> some apache web stuff, some tomcats, some java processes), my machines are
> fairly up-to-date (debian lenny), and my team are skilled system engineers. If
> you want, I'm even willing to give you guys SSH access into my boxes to look
> around; just get in touch with me.

Do you have an extra box that you can use just to rule out any HW related issue? Not saying that is the issue just thinking that might reveal something.

Comment 23 bernard grymonpon 2011-06-06 16:31:39 UTC

(In reply to comment #22)
> (In reply to comment #21)
> > (In reply to comment #20)
> > > (In reply to comment #19)
> > > > (In reply to comment #18)
> > > > > (In reply to comment #17)
> > > > > > (In reply to comment #16)
> > > > > > > (In reply to comment #14)
> > > > > > > > (In reply to comment #12)
> > > > > > > > > hi bernard,
> > > > > 
> > > > > <snip>
> > > > > 
> > > > > > >        The reason you are facing problems is because replicate performs reads
> > > > > > > from the first brick that is up, and in your set up the first brick in the
> > > > > > > replica-pair is the bad one i.e. it needs to be self-healed.
> > > > > > > We can fix it. Perform this task when the activity on the mount point is low,
> > > > > > > because self-heal can cause performance hit based on the size of the files that
> > > > > > > need to be self-healed.
> > > > > > > first set the preferred read-subvolume to be the good-brick using the following
> > > > > > > command. 
> > > > > > > 1) # gluster volume set export cluster.read-subvolume export-client-1
> > > > > > > This will make replicate to perform reads from the currently active brick (we
> > > > > > > were referring to that brick as the "good-brick"). ls wont show 0 size files
> > > > > > > after you do this even if the bad brick is up.
> > > > > > > 2) Empty the bad brick.
> > > > > > > 3) Bring the bad brick up using the command.
> > > > > > > # gluster volume start export force
> > > > > > > 4) From one of the clients trigger self-heal using 
> > > > > > > # find <gluster-mount> -print0 | xargs --null stat >/dev/null .
> > > > > > > Let me know if you face any problems.
> > > > > > 
> > > > > > And why would the brick be bad? Also, from what I remember Bernard reported on
> > > > > > IIRC that after umount and mount everything runs ok for a while but agains goes
> > > > > > bad after a while.
> > > > > 
> > > > > Correct, but anything will do now, as my client is getting a bit restless. I'll
> > > > > try to fix this, and I'll hope that the reason everything went haywire was me
> > > > > making a mistake somehow.
> > > > > 
> > > > > When you state "empty the bad brick" - I assume that I should just "rm -Rf" the
> > > > > complete content of the brick? Maybe even do a re-mkfs.xfs or mkfs.ext3 on it?
> > > > 
> > > > Sure. It is better to use same filesystems on both the bricks to have uniform
> > > > performance. Just make it empty what command u wanna use is ur choice
> > > hi Bernarnd,
> > >        Did you get a chance to perform this?. What is the result?.
> > > Pranith
> > I did make the change. Not everything went that smooth... A summary.
> > First of all, i reformatted the bad bricks (export-0) storage point, started
> > gluster, and then did a "start volume export force". As was normal, a certain
> > slowdown on all the nodes, and my renewed node started to grow in used
> > disksize. The logs filled up with heal-messages, all very normal.
> > However, after a minute or two, I got stuck processes again, waiting for file
> > access. When I visited the mountpoints and did "ls" or "cat <file>" in them, I
> > could wait forever for a response (sometimes it came after 5 minutes, sometimes
> > I cancelled the operation.
> > At this point, all of this seemed strange, as I suspected to get the same
> > responsiveness as with only one node online, as all read-operations would go to
> > the good node, the one being capable of delivering all data to all clients when
> > the bad brick was offlined. Guess not...
> > In my setup, there are two busy nodes, and a lot of more dormant nodes. On the
> > two busier nodes, I unmounted the mount (done with gluster fuse before) and
> > remounted with NFS. I expected better performance due to file caching, and
> > after reading some discussions on the mailinglists and IRC channel. 
> > The NFS mount worked fine, and access was snappy. However, I got several
> > folders where a couple of the files or folders were listed on the client, while
> > they all exist on the good brick. Again, a clear indication that the
> > reading-off-one-brick didn't really work as needed. I double checked the log
> > and the NFS server actually does have the "preferred read child" in it... 
> > I assumed (wrongly?) that the preferred read child option actually would force
> > each and every read action to go to the good brick, and return good data. Only
> > write operations should go to the other brick.
> > At that point, I got fed up with it and I killed the second brick, redid the
> > NFS-mounts, and everything is up again, performing as needed. The NFS mounts
> > are working (and I'll keep them like this for now, I have the impression that
> > they are faster). 
> > I tried enabling the bad brick again on one of least loaded moments on the
> > setup, but the scenario was the same... locked up clients, half-listed
> > directory content, mount-points becoming inaccessible... And I killed the bad
> > brick again.
> > To be honest, I'm losing my faith in gluster here. My setup is very simple (two
> > storage nodes, simple replication), my workload is not crazy high (10 clients,
> > some apache web stuff, some tomcats, some java processes), my machines are
> > fairly up-to-date (debian lenny), and my team are skilled system engineers. If
> > you want, I'm even willing to give you guys SSH access into my boxes to look
> > around; just get in touch with me.
> 
> Do you have an extra box that you can use just to rule out any HW related
> issue? Not saying that is the issue just thinking that might reveal something.

I'm pretty sure the hardware is perfectly okay... It is no option to replace these machines short-term.

Comment 24 Pranith Kumar K 2011-06-07 09:42:54 UTC

Created attachment 507

Comment 25 Pranith Kumar K 2011-06-07 09:43:32 UTC

Created attachment 508

Comment 26 Pranith Kumar K 2011-06-07 09:46:22 UTC

hi Bernard,
     Please apply the 2 patches attached and give it a try, let us know if you still face any problems. We were able to complete self-heal in a similar setup as yours with these patches. These need to be applied on the release-3.2 branch

Pranith

Comment 27 bernard grymonpon 2011-06-08 18:11:51 UTC

Patches applied correctly and recompile worked fine.

I updated and installed on every client and the storage nodes. After that, I restarted first everything on the good brick, then every client was remounted, and finally the second (bad) brick was started.

(All mounts are through gluster client, no NFS)

After a while, on one of the busier nodes, I saw stuck processes again, but I could kill them without having to kill the mount. I assume it is just to slow to read with all the healing it has to do. 

Is there a way to get the initial data over to ease the healing? Can I just rsync it first from the other good brick?

Currently, the second brick has been shut down again, and everything runs fine again.

Comment 28 Pranith Kumar K 2011-06-09 03:17:02 UTC

(In reply to comment #27)
> Patches applied correctly and recompile worked fine.
> 
> I updated and installed on every client and the storage nodes. After that, I
> restarted first everything on the good brick, then every client was remounted,
> and finally the second (bad) brick was started.
> 
> (All mounts are through gluster client, no NFS)
> 
> After a while, on one of the busier nodes, I saw stuck processes again, but I
> could kill them without having to kill the mount. I assume it is just to slow
> to read with all the healing it has to do. 
> 
> Is there a way to get the initial data over to ease the healing? Can I just
> rsync it first from the other good brick?
> 
> Currently, the second brick has been shut down again, and everything runs fine
> again.

hi Bernard,
      The way such a situation is handled with our customers is the following:
Suppose you have 2 mounts that are serving the data from the bricks.
We create a 3rd mount for the sole purpose of self-heal. On the two mounts that are serving data we disable the self-heal.
This will make sure that the self-heals will not be triggered by the mounts that are serving the data, which generally is the reason for slow reads.

I am assuming you have already done the step-0:
0) gluster volume set export cluster.read-subvolume export-client-1 

here are the steps to perform:
Let's make sure that the existing mounts are not performing self-heals
1) sudo gluster volume set export cluster.data-self-heal off
2) sudo gluster volume set export cluster.metadata-self-heal off
3) sudo gluster volume set export cluster.entry-self-heal off

After the above commands are executed, do the following

4) Copy the file /etc/glusterd/vols/export/export-fuse.vol to /etc/glusterd/vols/export/export-fuse-self-heal.vol
edit the following section in export-fuse-self-heal.vol from:
volume export-replicate-0
    type cluster/replicate
    option read-subvolume export-client-1
    option metadata-self-heal off
    option data-self-heal off
    option entry-self-heal off
    subvolumes export-client-0 export-client-1
end-volume

to:

volume export-replicate-0
    type cluster/replicate
    option read-subvolume export-client-1
    subvolumes export-client-0 export-client-1
end-volume

save the file.

5) Create a new mount with the following command:
I am assuming that the mount with self-heal enabled is going to happen on /mnt/client
so do mkdir /mnt/client, after that launch the client process
*) glusterfs -f /etc/glusterd/vols/export/export-fuse-self-heal.vol /mnt/client -l /etc/glusterd/vols/export/export-fuse-self-heal.log
check that the mount succeeded, cd into the mount and see that you find your brick contents.

6) Initiate the self-heal using the command:
again I am assuming that /mnt/client is the mount point where self-heal happens.
find /mnt/client -print0 | xargs --null stat >/dev/null

The command above can take a while based on the amount of data it needs to heal.

7) After the self-heal is completed you need to, enable the self-heal options to "on" again, then unmount the self-heal mount in that order.

I have tried these exact commands on my laptop with small number of files, it worked as expected for me.

Let us know if you need any more help.

Thanks
Pranith.

Comment 29 Pranith Kumar K 2011-06-13 02:19:54 UTC

(In reply to comment #28)
> (In reply to comment #27)
> > Patches applied correctly and recompile worked fine.
> > 
> > I updated and installed on every client and the storage nodes. After that, I
> > restarted first everything on the good brick, then every client was remounted,
> > and finally the second (bad) brick was started.
> > 
> > (All mounts are through gluster client, no NFS)
> > 
> > After a while, on one of the busier nodes, I saw stuck processes again, but I
> > could kill them without having to kill the mount. I assume it is just to slow
> > to read with all the healing it has to do. 
> > 
> > Is there a way to get the initial data over to ease the healing? Can I just
> > rsync it first from the other good brick?
> > 
> > Currently, the second brick has been shut down again, and everything runs fine
> > again.
> 
> hi Bernard,
>       The way such a situation is handled with our customers is the following:
> Suppose you have 2 mounts that are serving the data from the bricks.
> We create a 3rd mount for the sole purpose of self-heal. On the two mounts that
> are serving data we disable the self-heal.
> This will make sure that the self-heals will not be triggered by the mounts
> that are serving the data, which generally is the reason for slow reads.
> 
> I am assuming you have already done the step-0:
> 0) gluster volume set export cluster.read-subvolume export-client-1 
> 
> here are the steps to perform:
> Let's make sure that the existing mounts are not performing self-heals
> 1) sudo gluster volume set export cluster.data-self-heal off
> 2) sudo gluster volume set export cluster.metadata-self-heal off
> 3) sudo gluster volume set export cluster.entry-self-heal off
> 
> After the above commands are executed, do the following
> 
> 4) Copy the file /etc/glusterd/vols/export/export-fuse.vol to
> /etc/glusterd/vols/export/export-fuse-self-heal.vol
> edit the following section in export-fuse-self-heal.vol from:
> volume export-replicate-0
>     type cluster/replicate
>     option read-subvolume export-client-1
>     option metadata-self-heal off
>     option data-self-heal off
>     option entry-self-heal off
>     subvolumes export-client-0 export-client-1
> end-volume
> 
> to:
> 
> volume export-replicate-0
>     type cluster/replicate
>     option read-subvolume export-client-1
>     subvolumes export-client-0 export-client-1
> end-volume
> 
> save the file.
> 
> 5) Create a new mount with the following command:
> I am assuming that the mount with self-heal enabled is going to happen on
> /mnt/client
> so do mkdir /mnt/client, after that launch the client process
> *) glusterfs -f /etc/glusterd/vols/export/export-fuse-self-heal.vol /mnt/client
> -l /etc/glusterd/vols/export/export-fuse-self-heal.log
> check that the mount succeeded, cd into the mount and see that you find your
> brick contents.
> 
> 6) Initiate the self-heal using the command:
> again I am assuming that /mnt/client is the mount point where self-heal
> happens.
> find /mnt/client -print0 | xargs --null stat >/dev/null
> 
> The command above can take a while based on the amount of data it needs to
> heal.
> 
> 7) After the self-heal is completed you need to, enable the self-heal options
> to "on" again, then unmount the self-heal mount in that order.
> 
> I have tried these exact commands on my laptop with small number of files, it
> worked as expected for me.
> 
> Let us know if you need any more help.
> 
> Thanks
> Pranith.

hi Bernard,
     Did you get a chance to try this out?. What is the result?

Pranith

Comment 30 bernard grymonpon 2011-06-13 04:24:28 UTC

Still not solved. All commands went well, I saw (before doing the find|stat) that clients with the self-heal disabled triggered creating empty files on the second storage,...

Then, during the night, I started the self-heal setup (worked well also), started the find-stat on some small folders, and replication seemed to heal right. Then started the self-heal on all folders.

During the night, no errors were reported, but once the load and number of requests started to go up, we got reading errors again (empty files where there weren't any), several clients not being able to read files (locking up), ... After I killed the bad brick again, everything released and all goes well again.

I've played a bit with the stat-prefetch, some caching, ... but nothing helped.

I'm puzzled at what is so special about my setup, that it can't seem to get right. Would it be better to kick out all the glusterfs clients and switch to NFS (I loose the seamless failover then, but hey...), so there is actually only one "client" (the nfs server) instead of a multitude of clients on the gluster system...

One thing I discovered is that my client has one folder on this system with +35000 files in it (in a single folder). Doing actions in that folder tends to lock up for a couple of minutes. Is gluster not able to cope with this amount of files?

Comment 31 Alex 2011-07-22 09:32:26 UTC

1) I have 2 machines running glusterd: 10.200.200.72 and 10.200.200.78;I create a volume "pgdata" replicated on both machines
2) I mount the gluster volume on 10.200.200.72 (mount.glusterfs 10.200.200.72:/pgdata /mnt/glusterfs)
3) I do echo "1">/mnt/glusterfs/test1
4) test1 file is propagated on 10.200.200.72 and 10.200.200.78
5) on 10.200.200.78 I cut the connection doing: iptables -A INPUT -s 10.200.200.72 -j REJECT           
6) I do echo "1">/mnt/glusterfs/test2
7) on 10.200.200.78 I reestablish the connection: iptables -F INPUT
8) I do echo "1">/mnt/glusterfs/test3

Problems:
 A) The file test3 is propagated on 10.200.200.72 and 10.200.200.78 but the file test2 has zero size on 10.200.200.78 and is ok on 10.200.200.72
 B) If I do not execute step 8, the file test2 does not appear on 10.200.200.78
 C) A weird thing: I run gluster 3.2 and 3.3 but in the client logs appears "GlusterFS-3.1.0". I never have installed 3.1


logs:
********************client (10.200.200.72)****************
[2011-07-22 06:54:38.721993] I [client.c:1883:client_rpc_notify] 0-pgdata-client-0: disconnected
[2011-07-22 06:54:49.174834] E [socket.c:1685:socket_connect_finish] 0-pgdata-client-0: connection to 10.200.200.78:24009 failed (Connection refused)
[2011-07-22 06:54:55.176017] I [client-handshake.c:1082:select_server_supported_programs] 0-pgdata-client-0: Using Program GlusterFS-3.1.0, Num (1298437), Version (310)
[2011-07-22 06:54:55.176435] I [client-handshake.c:913:client_setvolume_cbk] 0-pgdata-client-0: Connected to 10.200.200.78:24009, attached to remote volume '/mnt/export1'.

*******************bricks (10.200.200.72)**********************
[2011-07-22 06:54:32.322301] W [socket.c:204:__socket_rwv] 0-tcp.pgdata-server: readv failed (Connection reset by peer)
[2011-07-22 06:54:32.322348] W [socket.c:1494:__socket_proto_state_machine] 0-tcp.pgdata-server: reading from socket failed. Error (Connection reset by peer), peer (10.200.200.78:1021)
[2011-07-22 06:54:32.322393] I [server.c:438:server_rpc_notify] 0-pgdata-server: disconnected connection from 10.200.200.78:1021
[2011-07-22 06:54:32.322408] I [server-helpers.c:783:server_connection_destroy] 0-pgdata-server: destroyed connection of iz-app5.presagia.lan-2211-2011/07/21-17:12:23:606117-pgdata-client-1
[2011-07-22 06:55:03.705375] I [server-handshake.c:534:server_setvolume] 0-pgdata-server: accepted client from 10.200.200.78:1021

*******************bricks(10.200.200.78)**************************
[2011-07-22 06:54:42.861553] W [socket.c:204:__socket_rwv] 0-tcp.pgdata-server: readv failed (Connection timed out)
[2011-07-22 06:54:42.861613] W [socket.c:1494:__socket_proto_state_machine] 0-tcp.pgdata-server: reading from socket failed. Error (Connection timed out), peer (10.200.200.72:1020)
[2011-07-22 06:54:42.861668] I [server.c:438:server_rpc_notify] 0-pgdata-server: disconnected connection from 10.200.200.72:1020
[2011-07-22 06:54:42.861701] I [server-helpers.c:783:server_connection_destroy] 0-pgdata-server: destroyed connection of prsdelphi-8971-2011/07/21-17:14:16:987973-pgdata-client-0
[2011-07-22 06:54:43.966523] W [socket.c:204:__socket_rwv] 0-tcp.pgdata-server: readv failed (Connection timed out)
[2011-07-22 06:54:43.966571] W [socket.c:1494:__socket_proto_state_machine] 0-tcp.pgdata-server: reading from socket failed. Error (Connection timed out), peer (10.200.200.72:1022)
[2011-07-22 06:54:43.966627] I [server.c:438:server_rpc_notify] 0-pgdata-server: disconnected connection from 10.200.200.72:1022
[2011-07-22 06:54:43.966658] I [server-helpers.c:783:server_connection_destroy] 0-pgdata-server: destroyed connection of prsdelphi-8832-2011/07/21-17:12:23:299761-pgdata-client-0
[2011-07-22 06:54:55.889277] I [server-handshake.c:534:server_setvolume] 0-pgdata-server: accepted client from 10.200.200.72:1020
[2011-07-22 06:55:01.188822] I [server-handshake.c:534:server_setvolume] 0-pgdata-server: accepted client from 10.200.200.72:1022

Comment 32 Pranith Kumar K 2011-11-22 07:24:43 UTC

(In reply to comment #31)
> 1) I have 2 machines running glusterd: 10.200.200.72 and 10.200.200.78;I create
> a volume "pgdata" replicated on both machines
> 2) I mount the gluster volume on 10.200.200.72 (mount.glusterfs
> 10.200.200.72:/pgdata /mnt/glusterfs)
> 3) I do echo "1">/mnt/glusterfs/test1
> 4) test1 file is propagated on 10.200.200.72 and 10.200.200.78
> 5) on 10.200.200.78 I cut the connection doing: iptables -A INPUT -s
> 10.200.200.72 -j REJECT           
> 6) I do echo "1">/mnt/glusterfs/test2
> 7) on 10.200.200.78 I reestablish the connection: iptables -F INPUT
> 8) I do echo "1">/mnt/glusterfs/test3
> 
> Problems:
>  A) The file test3 is propagated on 10.200.200.72 and 10.200.200.78 but the
> file test2 has zero size on 10.200.200.78 and is ok on 10.200.200.72
>  B) If I do not execute step 8, the file test2 does not appear on 10.200.200.78
>  C) A weird thing: I run gluster 3.2 and 3.3 but in the client logs appears
> "GlusterFS-3.1.0". I never have installed 3.1
> 
> 
> logs:
> ********************client (10.200.200.72)****************
> [2011-07-22 06:54:38.721993] I [client.c:1883:client_rpc_notify]
> 0-pgdata-client-0: disconnected
> [2011-07-22 06:54:49.174834] E [socket.c:1685:socket_connect_finish]
> 0-pgdata-client-0: connection to 10.200.200.78:24009 failed (Connection
> refused)
> [2011-07-22 06:54:55.176017] I
> [client-handshake.c:1082:select_server_supported_programs] 0-pgdata-client-0:
> Using Program GlusterFS-3.1.0, Num (1298437), Version (310)
> [2011-07-22 06:54:55.176435] I [client-handshake.c:913:client_setvolume_cbk]
> 0-pgdata-client-0: Connected to 10.200.200.78:24009, attached to remote volume
> '/mnt/export1'.
> 
> *******************bricks (10.200.200.72)**********************
> [2011-07-22 06:54:32.322301] W [socket.c:204:__socket_rwv] 0-tcp.pgdata-server:
> readv failed (Connection reset by peer)
> [2011-07-22 06:54:32.322348] W [socket.c:1494:__socket_proto_state_machine]
> 0-tcp.pgdata-server: reading from socket failed. Error (Connection reset by
> peer), peer (10.200.200.78:1021)
> [2011-07-22 06:54:32.322393] I [server.c:438:server_rpc_notify]
> 0-pgdata-server: disconnected connection from 10.200.200.78:1021
> [2011-07-22 06:54:32.322408] I [server-helpers.c:783:server_connection_destroy]
> 0-pgdata-server: destroyed connection of
> iz-app5.presagia.lan-2211-2011/07/21-17:12:23:606117-pgdata-client-1
> [2011-07-22 06:55:03.705375] I [server-handshake.c:534:server_setvolume]
> 0-pgdata-server: accepted client from 10.200.200.78:1021
> 
> *******************bricks(10.200.200.78)**************************
> [2011-07-22 06:54:42.861553] W [socket.c:204:__socket_rwv] 0-tcp.pgdata-server:
> readv failed (Connection timed out)
> [2011-07-22 06:54:42.861613] W [socket.c:1494:__socket_proto_state_machine]
> 0-tcp.pgdata-server: reading from socket failed. Error (Connection timed out),
> peer (10.200.200.72:1020)
> [2011-07-22 06:54:42.861668] I [server.c:438:server_rpc_notify]
> 0-pgdata-server: disconnected connection from 10.200.200.72:1020
> [2011-07-22 06:54:42.861701] I [server-helpers.c:783:server_connection_destroy]
> 0-pgdata-server: destroyed connection of
> prsdelphi-8971-2011/07/21-17:14:16:987973-pgdata-client-0
> [2011-07-22 06:54:43.966523] W [socket.c:204:__socket_rwv] 0-tcp.pgdata-server:
> readv failed (Connection timed out)
> [2011-07-22 06:54:43.966571] W [socket.c:1494:__socket_proto_state_machine]
> 0-tcp.pgdata-server: reading from socket failed. Error (Connection timed out),
> peer (10.200.200.72:1022)
> [2011-07-22 06:54:43.966627] I [server.c:438:server_rpc_notify]
> 0-pgdata-server: disconnected connection from 10.200.200.72:1022
> [2011-07-22 06:54:43.966658] I [server-helpers.c:783:server_connection_destroy]
> 0-pgdata-server: destroyed connection of
> prsdelphi-8832-2011/07/21-17:12:23:299761-pgdata-client-0
> [2011-07-22 06:54:55.889277] I [server-handshake.c:534:server_setvolume]
> 0-pgdata-server: accepted client from 10.200.200.72:1020
> [2011-07-22 06:55:01.188822] I [server-handshake.c:534:server_setvolume]
> 0-pgdata-server: accepted client from 10.200.200.72:1022

Alex,
   The file contents will be healed when you try to access the file again. so if you do ls /mnt/glusterfs/test2 then the file contents will be self-healed.

The GlusterFS-3.1.0 string you are seeing is the RPC version which has not changed from the time we have started using it in 3.1.0. Dont worry about that.

Pranith

Comment 33 Pranith Kumar K 2011-11-22 07:48:17 UTC

(In reply to comment #30)
> Still not solved. All commands went well, I saw (before doing the find|stat)
> that clients with the self-heal disabled triggered creating empty files on the
> second storage,...
> 
> Then, during the night, I started the self-heal setup (worked well also),
> started the find-stat on some small folders, and replication seemed to heal
> right. Then started the self-heal on all folders.
> 
> During the night, no errors were reported, but once the load and number of
> requests started to go up, we got reading errors again (empty files where there
> weren't any), several clients not being able to read files (locking up), ...
> After I killed the bad brick again, everything released and all goes well
> again.
> 
> I've played a bit with the stat-prefetch, some caching, ... but nothing helped.
> 
> I'm puzzled at what is so special about my setup, that it can't seem to get
> right. Would it be better to kick out all the glusterfs clients and switch to
> NFS (I loose the seamless failover then, but hey...), so there is actually only
> one "client" (the nfs server) instead of a multitude of clients on the gluster
> system...
> 
> One thing I discovered is that my client has one folder on this system with
> +35000 files in it (in a single folder). Doing actions in that folder tends to
> lock up for a couple of minutes. Is gluster not able to cope with this amount
> of files?

hi Bernanard,
      I did not get a chance to look into this bug after my last comment. Did you find a way to get to normal state?. To answer the question in your last comment: When entries in a directory have to be self-healed, a lock is taken on that directory and after creating all the 35000 files in that directory, it unlocks that directory, so the directory will appear to be hung for that time.
I tested that creation of that many files do take order of minutes.
pranithk @ /home/gfs3
16:01:09 :( $ date; for i in {1..35000}; do sudo touch $i; done; date
Tue Nov 22 16:01:15 IST 2011
Tue Nov 22 16:07:28 IST 2011

Pranith.

Comment 34 bernard grymonpon 2011-11-22 08:57:45 UTC

We stopped using GlusterFS. I have no further comments on this issue as the setup has been removed and reinstalled...

Feel free to close this bug.

Comment 35 Vijay Bellur 2012-12-11 05:05:42 UTC

Closing as requested information has not been provided.