Bug 1027364 - Quota: Copying of data hangs with a kernel backtrace in /var/log/messages.
Summary: Quota: Copying of data hangs with a kernel backtrace in /var/log/messages.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: glusterfs
Version: 2.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Raghavendra G
QA Contact: Gowrishankar Rajaiyan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-11-06 16:30 UTC by Gowrishankar Rajaiyan
Modified: 2013-11-27 15:46 UTC (History)
8 users (show)

Fixed In Version: glusterfs-3.4.0.44rhs-1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-11-27 15:46:23 UTC
Embargoed:


Attachments (Terms of Use)
statedump of the GlusterFS native mount process (186.79 KB, text/plain)
2013-11-07 12:02 UTC, Gowrishankar Rajaiyan
no flags Details
gluster volume statedump shanks-quota all (14.58 KB, text/plain)
2013-11-08 08:48 UTC, Gowrishankar Rajaiyan
no flags Details
kill -USR1 <pidof quotad> (7.83 KB, text/plain)
2013-11-08 08:51 UTC, Gowrishankar Rajaiyan
no flags Details
gluster volume statedump shanks-vol all (7.83 KB, text/plain)
2013-11-11 09:32 UTC, Gowrishankar Rajaiyan
no flags Details
rhs-new-files-brick3.29861.dump.1384162483 (10.50 KB, text/plain)
2013-11-11 09:41 UTC, Gowrishankar Rajaiyan
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2013:1769 0 normal SHIPPED_LIVE Red Hat Storage 2.1 enhancement and bug fix update #1 2013-11-27 20:17:39 UTC

Description Gowrishankar Rajaiyan 2013-11-06 16:30:17 UTC
Description of problem: Kernel backtrace detected and copying hangs on a fuse mounted directory with a data of ~25G.


Version-Release number of selected component (if applicable):
RHEL 6.5 client: kernel-2.6.32-430.el6.x86_64 & glusterfs-fuse-3.4.0.39rhs-1.el6_4.x86_64

How reproducible: Consistently.


Steps to Reproduce:
1. Fuse mount client.
2. Copy data of around 25G using Nautilus. (copying hangs indefinitely after ~3.4G)


Actual results: Copying hangs with the following messages in /var/log/messages

Nov  6 16:25:27 localhost kernel: INFO: task nautilus:2520 blocked for more than 120 seconds.
Nov  6 16:25:27 localhost kernel:      Not tainted 2.6.32-430.el6.x86_64 #1
Nov  6 16:25:27 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  6 16:25:27 localhost kernel: nautilus      D 0000000000000000     0  2520   2074 0x00000080
Nov  6 16:25:27 localhost kernel: ffff88020e7f5e38 0000000000000082 ffff88020e7f5dd8 ffffffff81054839
Nov  6 16:25:27 localhost kernel: ffff8802124296e8 0000000300000001 00000000000000d0 ffff880215700840
Nov  6 16:25:27 localhost kernel: ffff880214ce1af8 ffff88020e7f5fd8 000000000000fbc8 ffff880214ce1af8
Nov  6 16:25:27 localhost kernel: Call Trace:
Nov  6 16:25:27 localhost kernel: [<ffffffff81054839>] ? __wake_up_common+0x59/0x90
Nov  6 16:25:27 localhost kernel: [<ffffffff8109b5ce>] ? prepare_to_wait+0x4e/0x80
Nov  6 16:25:27 localhost kernel: [<ffffffffa0186085>] fuse_request_send+0xe5/0x290 [fuse]
Nov  6 16:25:27 localhost kernel: [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
Nov  6 16:25:27 localhost kernel: [<ffffffffa018bff6>] fuse_flush+0x106/0x140 [fuse]
Nov  6 16:25:27 localhost kernel: [<ffffffff81185b6c>] filp_close+0x3c/0x90
Nov  6 16:25:27 localhost kernel: [<ffffffff81185c65>] sys_close+0xa5/0x100
Nov  6 16:25:27 localhost kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Nov  6 16:25:27 localhost kernel: INFO: task gvfsd-metadata:2515 blocked for more than 120 seconds.
Nov  6 16:25:27 localhost kernel:      Not tainted 2.6.32-430.el6.x86_64 #1
Nov  6 16:25:27 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  6 16:25:27 localhost kernel: gvfsd-metadat D 0000000000000001     0  2515      1 0x00000080
Nov  6 16:25:27 localhost kernel: ffff8800378c5e38 0000000000000082 ffff8800378c5dd8 ffffffff81054839
Nov  6 16:25:27 localhost kernel: ffff880213a02890 0000000300000001 00000000000000d0 ffff880215700840
Nov  6 16:25:27 localhost kernel: ffff8802152ea638 ffff8800378c5fd8 000000000000fbc8 ffff8802152ea638
Nov  6 16:25:27 localhost kernel: Call Trace:
Nov  6 16:25:27 localhost kernel: [<ffffffff81054839>] ? __wake_up_common+0x59/0x90
Nov  6 16:25:27 localhost kernel: [<ffffffff8109b5ce>] ? prepare_to_wait+0x4e/0x80
Nov  6 16:25:27 localhost kernel: [<ffffffffa0186085>] fuse_request_send+0xe5/0x290 [fuse]
Nov  6 16:25:27 localhost kernel: [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
Nov  6 16:25:27 localhost kernel: [<ffffffffa018bff6>] fuse_flush+0x106/0x140 [fuse]
Nov  6 16:25:27 localhost kernel: [<ffffffff81185b6c>] filp_close+0x3c/0x90
Nov  6 16:25:27 localhost kernel: [<ffffffff81185c65>] sys_close+0xa5/0x100
Nov  6 16:25:27 localhost kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Nov  6 16:27:27 localhost kernel: INFO: task nautilus:2520 blocked for more than 120 seconds.
Nov  6 16:27:27 localhost kernel:      Not tainted 2.6.32-430.el6.x86_64 #1
Nov  6 16:27:27 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  6 16:27:27 localhost kernel: nautilus      D 0000000000000000     0  2520   2074 0x00000080
Nov  6 16:27:27 localhost kernel: ffff88020e7f5e38 0000000000000082 ffff88020e7f5dd8 ffffffff81054839
Nov  6 16:27:27 localhost kernel: ffff8802124296e8 0000000300000001 00000000000000d0 ffff880215700840
Nov  6 16:27:27 localhost kernel: ffff880214ce1af8 ffff88020e7f5fd8 000000000000fbc8 ffff880214ce1af8
Nov  6 16:27:27 localhost kernel: Call Trace:
Nov  6 16:27:27 localhost kernel: [<ffffffff81054839>] ? __wake_up_common+0x59/0x90
Nov  6 16:27:27 localhost kernel: [<ffffffff8109b5ce>] ? prepare_to_wait+0x4e/0x80
Nov  6 16:27:27 localhost kernel: [<ffffffffa0186085>] fuse_request_send+0xe5/0x290 [fuse]
Nov  6 16:27:27 localhost kernel: [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
Nov  6 16:27:27 localhost kernel: [<ffffffffa018bff6>] fuse_flush+0x106/0x140 [fuse]
Nov  6 16:27:27 localhost kernel: [<ffffffff81185b6c>] filp_close+0x3c/0x90
Nov  6 16:27:27 localhost kernel: [<ffffffff81185c65>] sys_close+0xa5/0x100
Nov  6 16:27:27 localhost kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Nov  6 16:27:27 localhost kernel: INFO: task gvfsd-metadata:2515 blocked for more than 120 seconds.
Nov  6 16:27:27 localhost kernel:      Not tainted 2.6.32-430.el6.x86_64 #1
Nov  6 16:27:27 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  6 16:27:27 localhost kernel: gvfsd-metadat D 0000000000000001     0  2515      1 0x00000080
Nov  6 16:27:27 localhost kernel: ffff8800378c5e38 0000000000000082 ffff8800378c5dd8 ffffffff81054839
Nov  6 16:27:27 localhost kernel: ffff880213a02890 0000000300000001 00000000000000d0 ffff880215700840
Nov  6 16:27:27 localhost kernel: ffff8802152ea638 ffff8800378c5fd8 000000000000fbc8 ffff8802152ea638
Nov  6 16:27:27 localhost kernel: Call Trace:
Nov  6 16:27:27 localhost kernel: [<ffffffff81054839>] ? __wake_up_common+0x59/0x90
Nov  6 16:27:27 localhost kernel: [<ffffffff8109b5ce>] ? prepare_to_wait+0x4e/0x80
Nov  6 16:27:27 localhost kernel: [<ffffffffa0186085>] fuse_request_send+0xe5/0x290 [fuse]
Nov  6 16:27:27 localhost kernel: [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
Nov  6 16:27:27 localhost kernel: [<ffffffffa018bff6>] fuse_flush+0x106/0x140 [fuse]
Nov  6 16:27:27 localhost kernel: [<ffffffff81185b6c>] filp_close+0x3c/0x90
Nov  6 16:27:27 localhost kernel: [<ffffffff81185c65>] sys_close+0xa5/0x100
Nov  6 16:27:27 localhost kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Comment 1 Ben Turner 2013-11-06 16:40:19 UTC
I am also seeing this during FS sanity bonnie++ runs as well.  Bonnie was taking a terribly long time to complete and when I logged in to check on things things I see:

INFO: task df:3147 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
df            D 0000000000000002     0  3147   2901 0x00000084
 ffff8800961abce8 0000000000000086 ffff8800961abc88 ffffffff81051439
 ffff8800961abc78 0000000300000000 ffff8800961abc88 ffff880152e0d040
 ffff880004dfe638 ffff8800961abfd8 000000000000fb88 ffff880004dfe638
Call Trace:
 [<ffffffff81051439>] ? __wake_up_common+0x59/0x90
 [<ffffffff8109708e>] ? prepare_to_wait+0x4e/0x80
 [<ffffffffa02c9075>] fuse_request_send+0xe5/0x290 [fuse]
 [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa02d1c0a>] fuse_statfs+0xea/0x160 [fuse]
 [<ffffffff811b3ec4>] statfs_by_dentry+0x74/0xa0
 [<ffffffff811b3ffb>] vfs_statfs+0x1b/0xb0
 [<ffffffff811b42c0>] do_statfs_native+0x20/0xb0
 [<ffffffff811b4452>] sys_statfs+0x72/0xb0
 [<ffffffff811813e2>] ? vfs_write+0x132/0x1a0
 [<ffffffff810dc937>] ? audit_syscall_entry+0x1d7/0x200
 [<ffffffff810dc685>] ? __audit_syscall_exit+0x265/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Comment 2 krishnan parthasarathi 2013-11-07 04:24:47 UTC
Shanks/Ben Turner,

Could you provide the statedump of the GlusterFS native mount process that is hung? Along with this, it would be good to have the log files as well.

Would it be possible to see if the issue is seen on other RHEL versions, say 6.4?

Ben,
Did you see this behaviour in the same RHEL/kernel/fuse versions as Shanks?

Comment 3 Gowrishankar Rajaiyan 2013-11-07 12:02:57 UTC
Created attachment 821038 [details]
statedump of the GlusterFS native mount process

Comment 4 Raghavendra G 2013-11-08 06:03:39 UTC
Shanks,

Can you also provide statedumps of brick and quotad processes? It seems like an issue in the code path where brick queries quotad for aggregated size of the directory, while enforcing quota limits.

regards,
Raghavendra.

Comment 5 Gowrishankar Rajaiyan 2013-11-08 06:51:12 UTC
I have seen this on volumes even without quota enabled.

Comment 6 Gowrishankar Rajaiyan 2013-11-08 08:48:48 UTC
Created attachment 821450 [details]
gluster volume statedump shanks-quota all

Comment 7 Gowrishankar Rajaiyan 2013-11-08 08:51:44 UTC
Created attachment 821451 [details]
kill -USR1 <pidof quotad>

Comment 8 Gowrishankar Rajaiyan 2013-11-11 09:32:12 UTC
Created attachment 822324 [details]
gluster volume statedump shanks-vol all

Comment 9 Gowrishankar Rajaiyan 2013-11-11 09:41:09 UTC
Created attachment 822329 [details]
rhs-new-files-brick3.29861.dump.1384162483

Comment 10 Gowrishankar Rajaiyan 2013-11-11 11:09:00 UTC
RHEL 6.4 client:

Nov 11 11:06:52 localhost kernel: INFO: task nautilus:2674 blocked for more than 120 seconds.
Nov 11 11:06:52 localhost kernel:      Not tainted 2.6.32-430.el6.x86_64 #1
Nov 11 11:06:52 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 11 11:06:52 localhost kernel: nautilus      D 0000000000000001     0  2674  16615 0x00000080
Nov 11 11:06:52 localhost kernel: ffff880212579e38 0000000000000086 0000000000000000 ffffffff81054839
Nov 11 11:06:52 localhost kernel: ffff88020daad728 0000000300000001 00000000000000d0 ffff88021560ac40
Nov 11 11:06:52 localhost kernel: ffff88020db5e638 ffff880212579fd8 000000000000fbc8 ffff88020db5e638
Nov 11 11:06:52 localhost kernel: Call Trace:
Nov 11 11:06:52 localhost kernel: [<ffffffff81054839>] ? __wake_up_common+0x59/0x90
Nov 11 11:06:52 localhost kernel: [<ffffffffa00e0085>] fuse_request_send+0xe5/0x290 [fuse]
Nov 11 11:06:52 localhost kernel: [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
Nov 11 11:06:52 localhost kernel: [<ffffffffa00e5ff6>] fuse_flush+0x106/0x140 [fuse]
Nov 11 11:06:52 localhost kernel: [<ffffffff81185b6c>] filp_close+0x3c/0x90
Nov 11 11:06:52 localhost kernel: [<ffffffff81185c65>] sys_close+0xa5/0x100
Nov 11 11:06:52 localhost kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b


Version: glusterfs-3.4.0.42rhs-1.el6_4.x86_64

Comment 11 Gowrishankar Rajaiyan 2013-11-11 11:09:53 UTC
I could not reproduce this with volume where quota is not enabled. Hence updating bug summary to reflect this.

Comment 12 Kaushal 2013-11-11 14:00:43 UTC
I cannot reproduce this with the reproducer steps that were given in the mail (steps reproduced below). I've tried to follow the steps as closely as possible.

-------- The steps given ---------
1. Create distributed replicate volume.
2. Enable quota and set deem-statfs to on and set limit.
[root@server1 ~]# gluster vol quota shanks-vol list
                   Path                   Hard-limit Soft-limit Used
Available
--------------------------------------------------------------------------------
/                                        200.0GB       80% 137.6GB  62.4GB
/shanks/Downloads                        175.0GB       80% 137.6GB  37.4GB

3. Mount shanks-vol on /home in RHEL 6.4 client
4. In RHEL 6.4 client, do "useradd shanks", "passwd shanks".
5. Launch GUI and navigate "Downloads" directory using nautilus.
6. So I downloaded ISO's which are 2.5G and over from
http://rhsqe-repo.lab.eng.blr.redhat.com/iso/windows/
7. Try to have "Downloads" directory with ~30G of data. (I downloaded
the same file twice/thrice)
8. Using nautilus, CTRL+a -> CTRL+c -> CTRL+p on the same "Downloads"
directory.
9. You should see the hang when the copying is at around ~5GB followed
by backtraces in /var/log/messages.
--------- end of the steps -----------

-------- My test -----------
I used glusterfs-3.4.0.40rhs-2.el6rhs rpms.
Created a 2x2 dist-rep volume over 4 nodes, enabled quota on it, set deem-statfs and set the limits.

[root@minion2 ~]# gluster volume info test
 
Volume Name: test
Type: Distributed-Replicate
Volume ID: b85a03e4-d8dd-4704-99d4-3db251535390
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: minion2.usersys.redhat.com:/brick/test
Brick2: minion3.usersys.redhat.com:/brick/test
Brick3: minion4.usersys.redhat.com:/brick/test
Brick4: minion5.usersys.redhat.com:/brick/test
Options Reconfigured:
features.quota-deem-statfs: on
features.quota: on

[root@minion2 ~]# gluster volume quota test list
                  Path                   Hard-limit Soft-limit   Used  Available
--------------------------------------------------------------------------------
/                                        100.0GB       80%      47.4GB  52.6GB
/Downloads                                75.0GB       80%      47.4GB  27.6GB

To test the exact steps given, I created a rhel-6.4 workstation vm, with 4GB of RAM and 2 cores. I installed rhel-64 client rpms of version glusterfs-3.4.0.40rhs-1.el6_4 on it. I mounted the volume on /mnt and copied some isos onto it to create an initial dataset of ~25GB.
Then I launched nautilus, navigated to /mnt/Downloads, and did Ctrl-a,c,v to create a duplicate copy of the data. All the files got duplicated successfully. I then tried duplicating the ~50GB of data, which also happened until the quota limit was hit. I didn't get hangs at 5GB as given in the steps.

I didn't do the additional useradd steps and the home mount as given in the steps. That may have caused the issue, but I'm pretty sure that it isn't the case.

Comment 15 Gowrishankar Rajaiyan 2013-11-15 06:21:59 UTC
Don't see the hang or backtraces any more. Good catch! :)

Verified: glusterfs-server-3.4.0.44rhs-1.el6rhs.x86_64

Comment 16 errata-xmlrpc 2013-11-27 15:46:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1769.html


Note You need to log in before you can comment on or make changes to this bug.