Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 835834

Summary:

glusterfs cilent hangs reproducibly

Product:

[Community] GlusterFS

Reporter:

Johannes Martin <jmartin>

Component:

core

Assignee:

Amar Tumballi <amarts>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Severity:

high

Docs Contact:

Priority:

high

Version:

3.3.0

CC:

aavati, amarts, csaba, gluster-bugs, pkarampu, rfortier, rgowdapp, sgowda, vraman

Target Milestone:

---

Keywords:

Reopened

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

glusterfs-3.4.0

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-07-24 17:51:45 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Debug log, lockup starts at 10:59:19, glusterfs killed 10:47:00	none
strace output	none
state dump of glusterfs mount process	none
state dump of glusterfs mount process after making sure glusterfs hangs	none
sample program files to reproduce error	none
Full output of strace -f for test case	none
Collection of debug information	none
stack trace of glusterfs client after lockup happens	none
strace of glusterfs client	none
option to disable the auxilary gid support in glusterfs	none

Description Johannes Martin 2012-06-27 08:59:37 UTC

Created attachment 594717 [details]
Debug log, lockup starts at 10:59:19, glusterfs killed 10:47:00

Description of problem:
After upgrade from glusterfs 3.1 to glusterfs 3.3.0, glusterfs hangs on executing a specific Java program.

Version-Release number of selected component (if applicable):
glusterfs 3.3.0 built on Jun 24 2012 22:48:03
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>


How reproducible: always

Steps to Reproduce:
1. mount glusterfs volume (which is replicated over two bricks)
2. start java program
  
Actual results:
glusterfs process hangs, the whole share becomes inaccessible on this client.

Expected results:
glusterfs share remains usable

Additional info:
As we have to run glusterfs on some legacy OSs, I had to compile it myself. Compilation was done on a debian lenny system, which runs the servers. The client runs SUSE Linux Enterprise Server 11 SP1. Client binaries were obtained by running alien on the deb package compile on the debian lenny system.

"ps" hangs as well after glusterfs stops working. "top" helps in identifying the process number of the hung glusterfs process.

iowaits rise to 25% when glusterfs hangs.

Killing the glusterfs process and remounting the share resolves the problem (ps returns, share becomes accessible again) until the program is run again.

glusterfs debug log and strace do not show any activity while glusterfs hangs (both attached).

Comment 1 Johannes Martin 2012-06-27 09:00:17 UTC

Created attachment 594718 [details]
strace output

Comment 2 Pranith Kumar K 2012-06-27 10:03:32 UTC

hi Johannes,
    Thanks for logging the bug with glusterfs. It is easier to see the problem if statedump of glusterfs mount process is available.

Procedure to capture statedump:
'ps aux | grep glusterfs' and see the pid of process corresponds to the mount process.

do 'kill -USR1 <pid-of-mount-process>'
it creates a file at /tmp/glusterdump.<pid-of-mount-process> 

Please attach that file to this bug report.

Comment 3 Johannes Martin 2012-06-27 10:12:39 UTC

Created attachment 594743 [details]
state dump of glusterfs mount process

Comment 4 Pranith Kumar K 2012-06-27 10:36:58 UTC

(In reply to comment #3)
> Created attachment 594743 [details]
> state dump of glusterfs mount process

The statedump needs to be captured when the mount process hangs. Are you sure you captured it when the mount process hung?.

Pranith

Comment 5 Johannes Martin 2012-06-27 11:34:06 UTC

Created attachment 594747 [details]
state dump of glusterfs mount process after making sure glusterfs hangs

When taking the previous state dump, I did not actually make sure that glusterfs was hanging. The test case has so far always led to a lockup though.

I performed the test case again, issued an ls within the glusterfs mount to make sure glusterfs was hanging and obtained the attached state dump.

Comment 6 Pranith Kumar K 2012-06-27 11:58:52 UTC

hi Johannes,
   Thanks for the statedumps. There are no call_frames that lead to hangs (i.e.) glusterfs responded to all the systemcalls that were sent to it. Thats a dead end.

In the bug description you said that the lockup starts after glusterfs process is killed. If you dont kill the glusterfs process does it still lockup?

In the strace output I see Resource temporarily Not available messages.
These generally happen for socket files. Does the program create socket files on glusterfs mount?

Pranith

Comment 7 Pranith Kumar K 2012-06-27 12:11:35 UTC

hi Johannes,
      Could you post the output of cat /sys/fs/fuse/connections/*/waiting
Is that non-zero?.

Pranith

Comment 8 Pranith Kumar K 2012-06-27 12:17:30 UTC

(In reply to comment #7)
> hi Johannes,
>       Could you post the output of cat /sys/fs/fuse/connections/*/waiting
> Is that non-zero?.
> 
> Pranith

Even this output needs to be captured when the mount process hangs.

Comment 9 Johannes Martin 2012-06-27 14:00:24 UTC

Hi Pranith, 

thank you for your help.

One clarification: the processes don't hang after I kill the glusterfs process. Killing the glusterfs process makes them go again (though they fail since the files they try to access are no longer there after glusterfs is killed).

I took an strace of the Java program that causes the lockup. Would that help you? Should I attach it to this ticket?

The last few lines in the program's strace read:
---
stat("/server/gluster/home/ulmf/Intersoft/stammdaten_tool/lib/axis.jar", {st_mode=S_IFREG|0644, st_size=1604148, ...}) = 0
stat("/server/gluster/home/ulmf/Intersoft/stammdaten_tool/lib/axis.jar", {st_mode=S_IFREG|0644, st_size=1604148, ...}) = 0
open("/server/gluster/home/ulmf/Intersoft/stammdaten_tool/lib/axis.jar", O_RDONLY) = 5
fstat(5, {st_mode=S_IFREG|0644, st_size=1604148, ...}) = 0
lseek(5, 0, SEEK_END)                   = 1604148
mmap(NULL, 1604148, PROT_READ, MAP_SHARED, 5, 0) = 0x7f720db61000
close(5)                                = 0
---

The straces continues with a SIGBUS once glusterfs is killed.

While glusterfs is still running:
# cat */waiting/sys/fs/fuse/connections
0
0
0
1

A while later, that "1" turned into a "2".

After killing glusterfs, output goes back to all zeroes.

Comment 10 Johannes Martin 2012-06-27 14:03:01 UTC

(In reply to comment #6)
> In the strace output I see Resource temporarily Not available messages.
> These generally happen for socket files. Does the program create socket
> files on glusterfs mount?

Not that I can tell. The strace shows only one socket call:
---
socket(PF_FILE, 0x80801 /* SOCK_??? */, 0) = 3
connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = 0
sendto(3, "\2\0\0\0\v\0\0\0\7\0\0\0passwd\0", 19, MSG_NOSIGNAL, NULL, 0) = 19
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP}], 1, 5000) = 1 ([{fd=3, revents=POLLIN|POLLHUP}])
recvmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"passwd\0", 7}, {"\270O\3\0\0\0\0\0", 8}], msg_controllen=24, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {4}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG
_CLOEXEC) = 15
mmap(NULL, 217016, PROT_READ, MAP_SHARED, 4, 0) = 0x7f72246f3000
close(4)                                = 0
close(3)                                = 0
---

Comment 11 Pranith Kumar K 2012-06-28 06:19:42 UTC

(In reply to comment #9)
> Hi Pranith, 
> 
> thank you for your help.
> 
> One clarification: the processes don't hang after I kill the glusterfs
> process. Killing the glusterfs process makes them go again (though they fail
> since the files they try to access are no longer there after glusterfs is
> killed).
> 
> I took an strace of the Java program that causes the lockup. Would that help
> you? Should I attach it to this ticket?
> 
> The last few lines in the program's strace read:
> ---
> stat("/server/gluster/home/ulmf/Intersoft/stammdaten_tool/lib/axis.jar",
> {st_mode=S_IFREG|0644, st_size=1604148, ...}) = 0
> stat("/server/gluster/home/ulmf/Intersoft/stammdaten_tool/lib/axis.jar",
> {st_mode=S_IFREG|0644, st_size=1604148, ...}) = 0
> open("/server/gluster/home/ulmf/Intersoft/stammdaten_tool/lib/axis.jar",
> O_RDONLY) = 5
> fstat(5, {st_mode=S_IFREG|0644, st_size=1604148, ...}) = 0
> lseek(5, 0, SEEK_END)                   = 1604148
> mmap(NULL, 1604148, PROT_READ, MAP_SHARED, 5, 0) = 0x7f720db61000
> close(5)                                = 0
> ---
> 
> The straces continues with a SIGBUS once glusterfs is killed.
> 
> While glusterfs is still running:
> # cat */waiting/sys/fs/fuse/connections
> 0
> 0
> 0
> 1
> 
> A while later, that "1" turned into a "2".
> 
> After killing glusterfs, output goes back to all zeroes.

That seems like an issue:
FYI: excerpt from fuse-documentation
-------------------------
Under the fuse control filesystem each connection has a directory
named by a unique number.

For each connection the following files exist within this directory:

 'waiting'

  The number of requests which are waiting to be transferred to
  userspace or being processed by the filesystem daemon.  If there is
  no filesystem activity and 'waiting' is non-zero, then the
  filesystem is hung or deadlocked.
-------------------------
Let me gather inputs from my colleagues about how to move forward.

Could you do these two experiments meanwhile:
1) Create a volume with just one brick and mount it, run the java program on this new mount and see  if it still hangs, that should isolate if it is replication issue or not.

2) If the hang still happens in experminet-1 above, mount the volume as 'nfs' and see if it hangs. That will isolate if it is just fuse-glusterfs issue or general issue in glusterfs

I would like to conduct the experiments myself if that java program is open-source or free software. Do let us know if that is the case.

Pranith

Comment 12 Johannes Martin 2012-06-28 06:58:20 UTC

Results of the experiments:
1) Test on volume with just one brick fails as well (i.e. glusterfs hangs, cat /sys/fs/fuse/connections/*/waiting not all 0.
2) Test on same volume mounted as nfs succeeds.

Unfortunately, the program is not open source. I will see whether I can reproduce the problem with some other Java program.

Comment 13 Pranith Kumar K 2012-06-28 07:28:24 UTC

(In reply to comment #12)
> Results of the experiments:
> 1) Test on volume with just one brick fails as well (i.e. glusterfs hangs,
> cat /sys/fs/fuse/connections/*/waiting not all 0.
> 2) Test on same volume mounted as nfs succeeds.
> 
> Unfortunately, the program is not open source. I will see whether I can
> reproduce the problem with some other Java program.

So on nfs it did not hang?. '/sys/fs/fuse/connections/*/waiting' should not be checked in this case. Do confirm this. I think the problem is with fuse translator.

Until we find the root cause we will need your help and 'java program'. I will mail devs experienced with fuse about this bug, is it ok if I cc you.

Comment 14 Johannes Martin 2012-06-28 07:47:39 UTC

Further experiments:
I wrote a minimal java test program:
public class Test {
    public static void main(String args[]) {
	System.out.println("Test");
    }
}

Program works fine by itself.

Only if I specify a long enough class path containing jars stored on the glusterfs volume I experience the lockup.

Example:
$JAVA_HOME/bin/java -cp avalon-framework-4.1.5.jar:axis-1.4-deurag-2009-01-20.jar:axis-jaxrpc-1.4.jar:axis-saaj-1.4.jar:axis-wsdl4j-1.5.1.jar:commons-logging-1.0.4.jar:commons-discovery-0.2.jar:commons-lang-2.4.jar:commons-codec-1.3.jar:javac-1.5.0.jar:fop-0.20.5.jar:batik-1.5-fop-0.20-5.jar:xml-apis-1.0.b2.jar:xalan-2.4.1.jar:xercesImpl-2.2.1.jar:mail-1.3.3_01.jar:activation-1.0.2.jar:log4j-1.2.16.jar:ganymed-ssh2-build209.jar:jcifs-1.3.14.jar:mysql-connector-java-3.1.11.jar:ojdbc14-10.1.0.2.0.jar:jai_core-1.1.2_01.jar:jai_codec-1.1.2_01.jar:jpgpmime-0.1.jar:poi-3.7.jar:poi-ooxml-3.7.jar:poi-ooxml-schemas-3.7.jar:xmlbeans-2.3.0.jar:stax-api-1.0.1.jar:geronimo-stax-api_1.0_spec-1.0.jar:dom4j-1.6.1.jar:. Test

The failure seems to happen only with JDK 5 (1.5.0_15).

1.6.0_18 and 1.7.0_02 do not show the lockup.

The test also works fine on a Ubuntu 12.04 virtual machine.

Fuse versions are likely different on these systems:
SLES11: 
dmesg | grep fuse
[   13.088554] fuse init (API version 7.13)
fusermount -V
fusermount version: 2.7.2


Ubuntu 12.04:
dmesg | grep fuse
[    1.025554] fuse init (API version 7.17)
fusermount -V
fusermount version: 2.8.6

Comment 15 Johannes Martin 2012-06-28 07:49:35 UTC

(In reply to comment #13)
> (In reply to comment #12)
> > Results of the experiments:
> > 1) Test on volume with just one brick fails as well (i.e. glusterfs hangs,
> > cat /sys/fs/fuse/connections/*/waiting not all 0.
> > 2) Test on same volume mounted as nfs succeeds.
> > 
> > Unfortunately, the program is not open source. I will see whether I can
> > reproduce the problem with some other Java program.
> 
> So on nfs it did not hang?. '/sys/fs/fuse/connections/*/waiting' should not
> be checked in this case. Do confirm this. I think the problem is with fuse
> translator.
Yes, nfs did not hang, so it looks like problem lies in fuse.

> Until we find the root cause we will need your help and 'java program'. I
> will mail devs experienced with fuse about this bug, is it ok if I cc you.
I could send you or attach to this report the jars listed in the command line above. They are all non-proprietary.

CCing me in the emails to the developers is ok.

Comment 16 Pranith Kumar K 2012-06-28 11:06:12 UTC

please upload the jar files and the procedure to run it.

Comment 17 Johannes Martin 2012-06-28 11:15:37 UTC

Created attachment 594992 [details]
sample program files to reproduce error

To reproduce:
- unzip sample program on glusterfs volume
- run ./batch.sh

The program does not seem to cause a lockup every time, but only most of the time.

Comment 18 Anand Avati 2012-07-03 00:10:06 UTC

Johannes,
  The bug is not reproducible at our end. We need the "strace -f" log of the application which gets hung. You have pasted just a few lines in one of the comments above, but that is incomplete (there is no indication about which system call resumed with SIGBUS). Please post the full and entire strace -f output log for the run which hangs.

Comment 19 Johannes Martin 2012-07-03 07:24:51 UTC

Created attachment 595897 [details]
Full output of strace -f for test case

Comment 20 Johannes Martin 2012-07-03 07:28:09 UTC

(In reply to comment #19)
> Created attachment 595897 [details]
> Full output of strace -f for test case

Info: process got stuck at line 1615 in strace output 
----
[pid 16866] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 16866] futex(0x40115118, FUTEX_WAKE_PRIVATE, 1) = 0
----

Comment 21 Csaba Henk 2012-07-03 09:14:23 UTC

Johannes,

sorry for having you bugged with reiterations on the log churning process, but I have to amend Avati's request with some more requirements to make the logs useful. (Partly because of some nonsense on behalf of strace.)

I ended up with writing a mini-essay...

## Fixing strace

- strace should be given an option like "-s 500" because default string fragment in strace logs is ridiculously small

- you should not simply redirect the strace output to a file in shell, but have strace to write it to the file himself with "-o FILE" _because strace produces different format in these two cases_ (OMG) (and the "-o"-kinda is much better).

So, your strace invocation should be something like:

$ strace -f -s500 -o glusterhang.strace.log ./batch.sh

## Also fusedump

Other thing I ask you is to get a fuse dump from your glusterfs client. That can be done as follows:

- glusterfs has to be invoked directly, not through the mount wrapper. If you already happen to do that, fine; else do a mount as usual, then identify the exact argument list that's used with glusterfs through ps or similar (my favorite is "pgrep glusterfs | xargs -r ps ww").

- for the test run, restart the glusterfs client with the standard arguments that you identified above, and following additional arguments: "--dump-fuse=/tmp/glusterhang.fdump" (or use other path, just make sure it's absolute).

- with all these in place, redo the test under strace.

## Process stack

One more request. If your system has stack support in procfs (that you can check by being able to cat /proc/self/stack), when the hang hits in, please send the stacks of the threads for the process hung; for that the best method is:

$ head -100 /proc/<pid>/task/*/stack > glusterhang.stack.log

or if there are more relevant pids,

$ head -100 /proc/{<pid_1>,...,<pid_n>}/task/*/stack > glusterhang.stack.log

("head -100" is just a hack to get nice annotated formatting for cheap).

## Harvest and send

Finally when the dire situatuon hits in, harvest and send us the resulting logs (glusterhang.strace.log, glusterhang.fdump, glusterhang.stack.log). For all of it to make sense, it's important to have them produced by the same test run. For sending, I suggest you to zip or targz up the logfiles so that it shall be evident in retrospect that they go togheter. If you happen to send additional per-session data (gluster statedump, /sys/fs/fuse/connections/*/waiting), please extend these guidelines to them as well.

Thanks,
Csaba

Comment 22 Johannes Martin 2012-07-03 09:59:11 UTC

Created attachment 595921 [details]
Collection of debug information

Attachment contains dumps etc. as requested.

One note: the --dump-fuse parameter is silently ignored if it is specified AFTER the mount point.

Comment 23 Csaba Henk 2012-07-03 13:47:58 UTC

Thanks Johannes for the quick reply.

I wasn't aware of significance of opt/arg order, good that you point out. (I feel it weird though, I thought GNU style option parsing, which we use AFAIK, scans for options until either the end of argv or "--" is met... anyone has a clue?)

Multiple weirdness is here. The hang as such is not directly in a VFS call (as it's the typical when not getting answer from the FUSE daemon), but it seems the
JVM hits a fatal condition and gets deadlocked in VM in the emergency code.

- strace:

9098  open("/server/gluster/nobackup/martinj/v1.9.47/ojdbc14-10.1.0.2.0.jar", O_RDONLY <unfinished ...>
9098  <... open resumed> )              = 3
9098  fstat(3, {st_mode=S_IFREG|0644, st_size=1353107, ...}) = 0
9098  lseek(3, 0, SEEK_END)             = 1353107
9098  mmap(NULL, 1353107, PROT_READ, MAP_SHARED, 3, 0 <unfinished ...>
9098  <... mmap resumed> )              = 0x7ffdd0233000
9098  close(3 <unfinished ...>
9098  <... close resumed> )             = 0
9098  --- SIGBUS (Bus error) @ 0 (0) ---
   ... 9098 writes out some crash report, then finally:
9098  rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
9098  tgkill(9098, 9098, SIGABRT)       = 0
9098  --- SIGABRT (Aborted) @ 0 (0) ---

That is followed by some reports on other threads killed by SIGABRT like;

9124  +++ killed by SIGABRT +++
9123  +++ killed by SIGABRT +++
9122  +++ killed by SIGABRT +++
...

but _not_ 9098.

@Johannes: do you have any clue why you got a SIGBUS? If glusterfs does not respond, I'd expect a hang in VFS there. Indeed, we get here SIGBUS...
And as of SIGABRT? Is it some internal mechanism, or user-initiated? What happens with the java process? It just hangs 'till you abort the FUSE mount? And you can still get the stack traces from the threads after the "killed by SIGABRT" event (as shown by strace -- sounds weird)? Or SIGABRT comes somehow atfer snapshotting the thread stacks?

- sanitized thread stacks (uninteresting parts stripped, common ones joined):

9098
[<ffffffff810b4395>] sync_page+0x35/0x60
[<ffffffff810b4511>] __lock_page+0x81/0xb0
[<ffffffff810b46d0>] find_lock_page+0x60/0x80
[<ffffffff810b4ee9>] filemap_fault+0x159/0x3d0
[<ffffffff810cd9f7>] __do_fault+0x57/0x520
[<ffffffff810d2349>] handle_mm_fault+0x199/0x430
[<ffffffff813999cf>] do_page_fault+0x1bf/0x3e0
[<ffffffff813972ef>] page_fault+0x1f/0x30
[<00007ffe1d99e7b0>] 0x7ffe1d99e7b0
[<ffffffffffffffff>] 0xffffffffffffffff

9107 9108 9109 9110 9112 9113 9120 9121 9123 9119 9124
[<ffffffff81073a05>] futex_wait_queue_me+0xb5/0x100
...

9111
[<ffffffff8105b9b2>] ptrace_stop+0xa2/0x160
...

9122
[<ffffffff811e5bb3>] call_rwsem_down_write_failed+0x13/0x20
[<ffffffffffffffff>] 0xffffffffffffffff

Apparently 9098 is stuck in some VM stuff, the others are just waiting in user-space synchronization code.

- fuse:

LOOKUP<len: 63 opcode: 1 unique: 220 nodeid: 139959571911736 uid: 1002 gid: 500 pid: 9098 padding: 0> <"ojdbc14-10.1.0.2.0.jar\0">
out_header<len: 144 error: 0 unique: 220> <entry_out<nodeid: 139959571915732 generation: 0 entry_valid: 1 attr_valid: 1 entry_valid_nsec: 0 attr_valid_nsec: 0 attr: attr<ino: 11209039086107610944 size: 1353107 blocks: 2648 atime: 1341299791 mtime: 1325511644 ctime: 1340867944 atimensec: 869711708 mtimensec: 0 ctimensec: 736127709 mode: 33188 nlink: 1 uid: 1002 gid: 500 rdev: 0 blksize: 131072 padding: 0>>>
OPEN<len: 48 opcode: 14 unique: 221 nodeid: 139959571915732 uid: 1002 gid: 500 pid: 9098 padding: 0> <open_in<flags: 32768 unused: 0>>
out_header<len: 32 error: 0 unique: 221> <open_out<fh: 6929632 open_flags: 0 padding: 0>>
FLUSH<len: 64 opcode: 25 unique: 222 nodeid: 139959571915732 uid: 1002 gid: 500 pid: 9098 padding: 0> <flush_in<fh: 6929632 unused: 0 padding: 0 lock_owner: 2655826312817703429>>
out_header<len: 16 error: 0 unique: 222> <>
READ<len: 80 opcode: 15 unique: 223 nodeid: 139959571915732 uid: 1002 gid: 500 pid: 9098 padding: 0> <read_in<fh: 6929632 offset: 1286144 size: 69632 read_flags: 0 lock_owner: 0 flags: 32768 padding: 0>>

So after OPEN, file gets FLUSH (close) and READ (mmap) immediately. The READ is not answered.

@Avati, Shishir, Du, Pranith: any race condition regard this that can prevent the READ to be answered?

- fusectl stuff:

We have _two_ outstanding FUSE requests. I guess one is the READ, and that the other would be the RELEASE part of close, although glusterfs does not get to process that.

@Pranith: this seems to contradict your statement  as of https://bugzilla.redhat.com/show_bug.cgi?id=835834#c6 -- can it be the case that glusterfs deadlocks at a lower level than frames would be allocated?

@Johannes: share one more thing please -- all the thread stacks from /proc for the glusterfs client process when it went defunct. I think it's OK if you send only that now, without taking all the logs again.

Comment 24 Csaba Henk 2012-07-03 13:50:50 UTC

[The fusedump excerpt is not really well readable in the way bugzilla shows it, for a fancier presentation go here please: http://fpaste.org/0KYg/]

Comment 25 Csaba Henk 2012-07-03 14:04:27 UTC

[To not get lost in the data flux, a quick summary whom I asked further questions (in @Name form) in order to be able make a progress with the investigation:

- Johannes: at two places
- Avati, Shishir, Du, Pranith addressed commonly: one place
- Pranith: one place
]

Comment 26 Amar Tumballi 2012-07-04 00:00:54 UTC

Changed the product, it was filed while we were handling some flags in Bugzilla to retire few inactive products.

Comment 27 Pranith Kumar K 2012-07-04 11:27:11 UTC

(In reply to comment #25)
> [To not get lost in the data flux, a quick summary whom I asked further
> questions (in @Name form) in order to be able make a progress with the
> investigation:
> 
> - Johannes: at two places
> - Avati, Shishir, Du, Pranith addressed commonly: one place
> - Pranith: one place
> ]

i am not sure about the common question,

regarding the pending frames:
I could be wrong. This is the call-frames part attached by Johannes:
[global.callpool.stack.1]
uid=1018
gid=500
pid=11572
unique=1359
lk-owner=0000000000000000
type=0
cnt=1

[global.callpool.stack.1.frame.1]
ref_count=0
translator=fuse
complete=0

[global.callpool.stack.2]
uid=0
gid=0
pid=0
unique=0
lk-owner=
type=0
cnt=1

[global.callpool.stack.2.frame.1]
ref_count=0
translator=glusterfs
complete=0

Whenever I looked at statedumps while debugging, I always saw one pending frame in fuse like this, I assumed both to be the same. Could you please explain why there is this one extra frame always in statedump?
(This is the one I took from my laptop just now.)
global.callpool.stack.1]
uid=0
gid=0
pid=0
unique=0
lk-owner=
type=0
cnt=1

[global.callpool.stack.1.frame.1]
ref_count=0
translator=glusterfs
complete=0

Comment 28 Csaba Henk 2012-07-04 18:44:30 UTC

Some FYI/updates.

- I succeeded to understand that the jvm stack

  [<ffffffff810b4395>] sync_page+0x35/0x60
  [<ffffffff810b4511>] __lock_page+0x81/0xb0
  [<ffffffff810b46d0>] find_lock_page+0x60/0x80
  ...

  indicates the usage of the mmap'd file region. So yeah, the unanswered
  READ message comes from the VM subsystem.

- OK, taking enough care to read earlier comments, I succeeded to understand
  that the SIGBUS happens when user kills glusterfs, and that it and all that
  comes after is irrelevant.

- Johannes' observation that the issue is seen with JDK 5 (and not later) can
  probably be caused by following JVM changes: read this comment in the JDK 6
  source (at the place where jar extraction takes place):

 http://hg.openjdk.java.net/jdk6/jdk6/jdk/file/99b43838c5d0/src/share/native/java/util/zip/zip_util.c, line 510

  A change is indicated in the I/O patterns from JDK 5 to 6 hereby (only JDK >=
  6 source seems to be available, so we cannot see how it used to be exactly).

- I'd really like to see glusterfs' thread stacks.

Comment 29 Johannes Martin 2012-07-05 08:10:15 UTC

Hi, sorry for taking so long to answer, we had a crazy day her yesterday...

Regarding the first question about SIGBUS/SIGABORT and thread stacks, but I guess you have figured out most of it already by now.

What I did was the following:
- start glusterfs mount
- start the program under strace -f
- wait for the lockup
- collect the additional information (thread stacks, ...)
- kill glusterfs

The Java program sits and waits silently until I kill glusterfs. Then it aborts with a JVM internal error message.

SIGBUS and SIGABORT happen only after I kill glusterfs (strace keeps writing to the log file).

Comment 30 Johannes Martin 2012-07-05 08:11:43 UTC

Created attachment 596348 [details]
stack trace of glusterfs client after lockup happens

Comment 31 Csaba Henk 2012-07-06 12:03:21 UTC

Huh, this is strange.

The thread that reads requests from the fuse device should normally be blocked on the respective read syscall.

Indeed, its stack is:

==> /proc/5326/task/5351/stack <==
[<ffffffff811e5b84>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

Also weird that we don't see a normal syscall stack, only
the call_rwsem_down_read_failed entry... so I can't even
imply what's being attempted at this point.

It might be due to some kernel bug or system issue.
Do you find the system completely sane in other respect?

Johannes, next thing I would like to ask from you is to run the glusterfs client itself under strace (with the same parametrization as suggested above) and send
- the strace log
- lsof output for glusterfs client
- dmesg output

To reduce noise, please collect all these when the hang is in place, but before you'd do any kind of intervention.

It's just a guesswork, don't know if these would contain anything usable... will see.

Comment 32 Csaba Henk 2012-07-06 12:06:07 UTC

... also, pls send the corresponding stack traces for gluster client, to confirm if it consistently produces this call_rwsem_down_read_failed phenomenon.

Comment 33 Johannes Martin 2012-07-06 13:31:28 UTC

Created attachment 596647 [details]
strace of glusterfs client

Added strace of glusterfs client at the time of the hangup as requested.

lsof got stuck itself and completed only after glusterfs was killed. So output of lsof reflects the situation and an unknown time.

dmesg did not show any new entries (last line is the leap second note from last Sunday).

The server runs find in all other respects, uptime is almost 6 months now.

It is a dedicated Oracle 11 server, so it might have some patches, extra modules, or sysctl from oracle.

I'll be on vacation for the next four weeks, so I probably will not be able to provide any more feedback till early August.

Comment 34 Csaba Henk 2012-07-06 14:14:00 UTC

OK. The strace log makes it much clearer.

We use the procfs to look up the auxiliary gids of the process
on behalf of which a given FUSE request was sent. We parse this out
from /proc/<pid>/status. What we can observe here is that the
read into /proc/<pid>/status hangs. (Nb. it was completed successfully
for several previous requests from the same process.)

It seems to be a kernel bug; I'll look around if such a case
was seen elsewhere, and if there is a legitimate or known
reason for this to happen.

Comment 35 Csaba Henk 2012-07-06 15:07:04 UTC

Johannes,

I did not find anything about /proc/<pid>/status hang; I think the issue is kernel issue, not in glusterfs' scope.

Workaround would be to comment out the frame_fill_groups() call in get_call_frame_for_req() of fuse-helpers.c:

https://github.com/gluster/glusterfs/blob/v3.3.0/xlators/mount/fuse/src/fuse-helpers.c#L278

Note that with this we won't have knowledge about aux gids in the fs -- is that a concern for you? (Most likely won't be; actually it's server side ACL support that relies on it, cf. https://bugzilla.redhat.com/show_bug.cgi?id=764547)

So, plan is a follows (probably then continued in August):

- Please recompile glusterfs client with commenting out the frame_fill_groups call (for testing purposes, it's fine if you download the 3.3.0 source, fixup, compile and install it with --prefix=/opt/somewhere, and use the client from there; I mean you don't need to replace the whole of your existing gluster ecosystem for the sake of this).

- Confirm if the issue goes away with patched client.

- If no: we'll return to the investigation of the issue.

- If yes: I'll close this bug, and you are suggested to
  - raise the issue on LKML / your distro's bugtracker / your OS vendor's
    support line (if possible, we would be grateful for adding a reference
    here to that discussion or issue tracker entry)
  - as workaround, deploy and use glusterfs with frame_fill_groups()
    commented out

Is this OK with you?

Comment 36 Csaba Henk 2012-07-06 17:48:39 UTC

(Accidentally closed the bug, now reopening.)

Comment 37 Johannes Martin 2012-08-30 12:58:40 UTC

Sorry for taking so long to get back to you.

I patched the client today and on the second attempt (first time I failed to to a make clean) the patched client worked without failure as you predicted.

I ran the program continuously for a couple of minutes using a while loop from within a shell to make sure it wasn't just one of those occasions when the test succeeded, but there wasn't a single failure.

I will try to raise a problem report with SUSE now and will post a link to the problem report there when I get one.

Comment 38 Johannes Martin 2012-08-30 13:01:11 UTC

P.S.: Doing the custom compile wasn't actually that easy since one of the components insisted on prefix ending in usr, so I ended up using prefix=/tmp/glusterfs/usr. Also had to do set LD_PRELOAD before starting glusterfs:

export LD_LIBRARY_PATH=/tmp/glusterfs/usr/lib
export LD_PRELOAD=/tmp/glusterfs/usr/lib/libglusterfs.so.0:/tmp/glusterfs/usr/lib/libgfrpc.so.0:/tmp/glusterfs/usr/lib/libgfxdr.so.0

Comment 39 Amar Tumballi 2012-09-18 10:07:43 UTC

Created attachment 613944 [details]
option to disable the auxilary gid support in glusterfs

not convinced about the patch completely, and hence is not going to review system yet.

Comment 40 Amar Tumballi 2012-09-18 10:17:08 UTC

posted the patch @ http://review.gluster.org/3954 share your comments there.

Comment 41 Amar Tumballi 2012-09-20 04:29:37 UTC

Patch submitted upstream. Next release should have the fix for you.