Bug 764814 (GLUSTER-3082) - Strange stack trace caused possibly by multiple servers accessing the same file on a gluster volume.
Summary: Strange stack trace caused possibly by multiple servers accessing the same fi...
Keywords:
Status: CLOSED NOTABUG
Alias: GLUSTER-3082
Product: GlusterFS
Classification: Community
Component: unclassified
Version: 3.2.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Raghavendra G
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-06-23 16:49 UTC by James Morelli
Modified: 2011-09-22 05:19 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: ---
Mount Type: fuse
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)
MiMedia internal bug report (74.85 KB, application/pdf)
2011-06-23 13:49 UTC, James Morelli
no flags Details
dmesg (52.89 KB, text/plain)
2011-06-24 11:15 UTC, James Morelli
no flags Details

Description James Morelli 2011-06-23 16:49:19 UTC
This bug was first noticed on June 6th 2011

Gluster version: 3.2.0 (installed via rpm)

Distribution: CentOS 5.4

Linux Kernel: 2.6.38.7 

Arch: x86_64

Doesn't occur on: Gluster version 3.1.3

While running glusterfs in our development environment we noticed a strange trace occurring on our servers. We proceeded to downgrade several development servers to gluster 3.1.3 and did not receive any traces for a week. Upon upgrade again to version 3.2.0 the stack trace reoccurred leading us to believe this to be a bug with gluster 3.2.0. 

The stack trace in question is below: 

INFO: task mm-sg-account-m:24151 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mm-sg-account-m D ffff88013a64c100     0 24151  24124 0x00000000
 ffff88013a64c100 0000000000000086 00000000000117c0 ffff88013a76b268
 ffff88013a76aee0 ffff88013a76af18 ffff8800bfd917c0 ffffffff8103e933
 ffff88013a76aee0 ffff88013a76aee0 ffff8800bfd917c0 ffffffff8100190c
Call Trace:
 [<ffffffff8103e933>] ? dequeue_task_fair+0xe3/0x170
 [<ffffffff8100190c>] ? __switch_to+0x1ec/0x2c0
 [<ffffffff812facd9>] ? schedule+0x379/0xa20
 [<ffffffff810af4e0>] ? sync_page+0x0/0x50
 [<ffffffff812fb4b6>] ? io_schedule+0x56/0x90
 [<ffffffff810af51b>] ? sync_page+0x3b/0x50
 [<ffffffff812fb80a>] ? __wait_on_bit_lock+0x4a/0x90
 [<ffffffff810af4bf>] ? __lock_page+0x5f/0x70
 [<ffffffff81062460>] ? wake_bit_function+0x0/0x30
 [<ffffffff810b97e7>] ? pagevec_lookup+0x17/0x20
 [<ffffffff810b9b35>] ? invalidate_inode_pages2_range+0x295/0x2c0
 [<ffffffff8105477e>] ? recalc_sigpending+0xe/0x30
 [<ffffffffa0328f1b>] ? fuse_request_send+0x21b/0x280 [fuse]
 [<ffffffffa03270c6>] ? fuse_request_init+0x36/0x40 [fuse]
 [<ffffffffa032e161>] ? fuse_do_open+0x121/0x180 [fuse]
 [<ffffffffa032df23>] ? fuse_finish_open+0xe3/0xf0 [fuse]
 [<ffffffffa032e253>] ? fuse_open_common+0x93/0xa0 [fuse]
 [<ffffffffa032e260>] ? fuse_open+0x0/0x10 [fuse]
 [<ffffffff810efe01>] ? __dentry_open+0xf1/0x2f0
 [<ffffffff810fc127>] ? inode_permission+0x47/0xc0
 [<ffffffff810fc85e>] ? finish_open+0xce/0x180
 [<ffffffff810feb7d>] ? do_filp_open+0x25d/0x700
 [<ffffffff81109d3f>] ? alloc_fd+0x3f/0x130
 [<ffffffff810efc25>] ? do_sys_open+0x65/0x110
 [<ffffffff81002bbb>] ? system_call_fastpath+0x16/0x1b

We believe this to be caused by the same file being accessed simultaneously by multiple servers.

Our internal bug report is attached to this issue and additional information can be provided.

Comment 1 Raghavendra G 2011-06-24 02:06:34 UTC
Hi James,

I need couple of clarifications on your bug-report:

* From your bug-report it seems like bonnie++ hangs when run simultaneously from two clients (running on two different machines) simultaneously and there are four instances of bonnie++ running (two on each client). Am I correct in my understanding? Does bonnie++ hang eternally or it comes out of hang after some time? Do the ps output show bonnie++ processes in 'D' state?

* If there is a hang, can you send gluster state-dump of both gluster clients at the time of hang, using following command?

# kill -SIGUSR1 <glusterfs-client-pid> (on each machine)

state-dump can be found in /tmp/glusterdump.<glusterfs-client-pid>

* Can you also send us the dmesg of both the servers (its attached in your internal bug report, but we don't have it)?

regards,
Raghavendra.

Comment 2 James Morelli 2011-06-24 11:15:42 UTC
Created attachment 531

Comment 3 James Morelli 2011-06-24 11:26:28 UTC
Hi Raghavendra,

I have attached the dmesg of the server with the stack trace to this issue. Our other server at the time was not seeing the stack trace so I did not save a copy of it's dmesg log. (I saw nothing unusual in it)

Bonnie++ was not the issue hanging. I am sorry for the confusion there, I should have been more specific. We ran bonnie++ in an attempt to recreate the hung process and stack trace but could not accomplish this through the use of bonnie++ as they were not accessing the same file, just putting stress on the volume (we originally thought stress was the issue)

However we run a process called "mm-sg-account-meta". This process will run on two servers simultaneously and access the same gluster filesystem. While this happens the process on both servers will simultaneously access a file on the gluster filesystem called "process map". When this happens we will eventually see this stack trace.

When we see the stack trace occurring, if we check the status of mm-sg-account-meta with ps it will be in the 'D' state.

I'd be glad to provide you with a state-dump however it may be a while before I can do so. We have downgraded back to gluster version 3.1.3 to ensure stability before our company does a production release.

Comment 4 Raghavendra G 2011-06-27 01:35:12 UTC
Hi James,

Please find my inlined comments:
(In reply to comment #3)
> Hi Raghavendra,
> 
> I have attached the dmesg of the server with the stack trace to this issue. Our
> other server at the time was not seeing the stack trace so I did not save a
> copy of it's dmesg log. (I saw nothing unusual in it)
> 
> Bonnie++ was not the issue hanging. I am sorry for the confusion there, I
> should have been more specific. We ran bonnie++ in an attempt to recreate the
> hung process and stack trace but could not accomplish this through the use of
> bonnie++ as they were not accessing the same file, just putting stress on the
> volume (we originally thought stress was the issue)
> 
> However we run a process called "mm-sg-account-meta". This process will run on
> two servers simultaneously and access the same gluster filesystem. While this
> happens the process on both servers will simultaneously access a file on the
> gluster filesystem called "process map". When this happens we will eventually
> see this stack trace.

Is there any test-case which we can use to reproduce the issue at our end? It seems that the process you are running is your own application. It would be great if you can produce a small test-case to reproduce the issue. Also, what kind of system-calls are executed by mm-sg-account-meta - writes, reads, stat etc? I wanted to know the pattern in which the process is accessing the file.

> 
> When we see the stack trace occurring, if we check the status of
> mm-sg-account-meta with ps it will be in the 'D' state.
> 
> I'd be glad to provide you with a state-dump however it may be a while before I
> can do so. We have downgraded back to gluster version 3.1.3 to ensure stability
> before our company does a production release.

Can you send us complete client and server side glusterfs logs?

regards,
Raghavendra.

Comment 5 James Morelli 2011-06-29 12:16:41 UTC
Hey Raghavendra,

We do not have any of our glusterfs logs to provide you with as our logging is done on a ramdisk, and we did not save the logs to disk prior to rebooting. 

However we will be able to reproduce this issue for you again and provide you with the these logs. Right now we are at the end of a release cycle and are unable to run gluster 3.2.0 as we are readying for a deploy. After this cycle we can however move back to gluster 3.2.0. When we do this we believe there will be a strong chance this issue will reappear.

We will provide you with more information at that time regarding how the process accesses the file as well.

Sorry for the inconvenience. 

James Morelli

Comment 6 James Morelli 2011-07-11 15:24:46 UTC
Under heavy load we recently saw similar kernel crashing with gluster version 3.1.3.

Stability was returned when we downgraded our kernel from 2.6.38 back to 2.6.32. Because of this we now estimate this issue may be an issue with FUSE in 2.6.38. We will continue to test and report any updates to this issue. It may however take several weeks for us to find new information.

Thank you

Comment 7 Amar Tumballi 2011-09-13 02:14:34 UTC
> 
> Stability was returned when we downgraded our kernel from 2.6.38 back to
> 2.6.32. Because of this we now estimate this issue may be an issue with FUSE in
> 2.6.38. We will continue to test and report any updates to this issue. It may
> however take several weeks for us to find new information.
> 

Hi James,

Do you have any comment on this bug now? Is this good to resolve considering it may be a FUSE issue in the kernel?

Regards,
Amar


Note You need to log in before you can comment on or make changes to this bug.