Description of problem: Kernel backtrace detected and copying hangs on a fuse mounted directory with a data of ~25G. Version-Release number of selected component (if applicable): RHEL 6.5 client: kernel-2.6.32-430.el6.x86_64 & glusterfs-fuse-3.4.0.39rhs-1.el6_4.x86_64 How reproducible: Consistently. Steps to Reproduce: 1. Fuse mount client. 2. Copy data of around 25G using Nautilus. (copying hangs indefinitely after ~3.4G) Actual results: Copying hangs with the following messages in /var/log/messages Nov 6 16:25:27 localhost kernel: INFO: task nautilus:2520 blocked for more than 120 seconds. Nov 6 16:25:27 localhost kernel: Not tainted 2.6.32-430.el6.x86_64 #1 Nov 6 16:25:27 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 6 16:25:27 localhost kernel: nautilus D 0000000000000000 0 2520 2074 0x00000080 Nov 6 16:25:27 localhost kernel: ffff88020e7f5e38 0000000000000082 ffff88020e7f5dd8 ffffffff81054839 Nov 6 16:25:27 localhost kernel: ffff8802124296e8 0000000300000001 00000000000000d0 ffff880215700840 Nov 6 16:25:27 localhost kernel: ffff880214ce1af8 ffff88020e7f5fd8 000000000000fbc8 ffff880214ce1af8 Nov 6 16:25:27 localhost kernel: Call Trace: Nov 6 16:25:27 localhost kernel: [<ffffffff81054839>] ? __wake_up_common+0x59/0x90 Nov 6 16:25:27 localhost kernel: [<ffffffff8109b5ce>] ? prepare_to_wait+0x4e/0x80 Nov 6 16:25:27 localhost kernel: [<ffffffffa0186085>] fuse_request_send+0xe5/0x290 [fuse] Nov 6 16:25:27 localhost kernel: [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40 Nov 6 16:25:27 localhost kernel: [<ffffffffa018bff6>] fuse_flush+0x106/0x140 [fuse] Nov 6 16:25:27 localhost kernel: [<ffffffff81185b6c>] filp_close+0x3c/0x90 Nov 6 16:25:27 localhost kernel: [<ffffffff81185c65>] sys_close+0xa5/0x100 Nov 6 16:25:27 localhost kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Nov 6 16:25:27 localhost kernel: INFO: task gvfsd-metadata:2515 blocked for more than 120 seconds. Nov 6 16:25:27 localhost kernel: Not tainted 2.6.32-430.el6.x86_64 #1 Nov 6 16:25:27 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 6 16:25:27 localhost kernel: gvfsd-metadat D 0000000000000001 0 2515 1 0x00000080 Nov 6 16:25:27 localhost kernel: ffff8800378c5e38 0000000000000082 ffff8800378c5dd8 ffffffff81054839 Nov 6 16:25:27 localhost kernel: ffff880213a02890 0000000300000001 00000000000000d0 ffff880215700840 Nov 6 16:25:27 localhost kernel: ffff8802152ea638 ffff8800378c5fd8 000000000000fbc8 ffff8802152ea638 Nov 6 16:25:27 localhost kernel: Call Trace: Nov 6 16:25:27 localhost kernel: [<ffffffff81054839>] ? __wake_up_common+0x59/0x90 Nov 6 16:25:27 localhost kernel: [<ffffffff8109b5ce>] ? prepare_to_wait+0x4e/0x80 Nov 6 16:25:27 localhost kernel: [<ffffffffa0186085>] fuse_request_send+0xe5/0x290 [fuse] Nov 6 16:25:27 localhost kernel: [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40 Nov 6 16:25:27 localhost kernel: [<ffffffffa018bff6>] fuse_flush+0x106/0x140 [fuse] Nov 6 16:25:27 localhost kernel: [<ffffffff81185b6c>] filp_close+0x3c/0x90 Nov 6 16:25:27 localhost kernel: [<ffffffff81185c65>] sys_close+0xa5/0x100 Nov 6 16:25:27 localhost kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Nov 6 16:27:27 localhost kernel: INFO: task nautilus:2520 blocked for more than 120 seconds. Nov 6 16:27:27 localhost kernel: Not tainted 2.6.32-430.el6.x86_64 #1 Nov 6 16:27:27 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 6 16:27:27 localhost kernel: nautilus D 0000000000000000 0 2520 2074 0x00000080 Nov 6 16:27:27 localhost kernel: ffff88020e7f5e38 0000000000000082 ffff88020e7f5dd8 ffffffff81054839 Nov 6 16:27:27 localhost kernel: ffff8802124296e8 0000000300000001 00000000000000d0 ffff880215700840 Nov 6 16:27:27 localhost kernel: ffff880214ce1af8 ffff88020e7f5fd8 000000000000fbc8 ffff880214ce1af8 Nov 6 16:27:27 localhost kernel: Call Trace: Nov 6 16:27:27 localhost kernel: [<ffffffff81054839>] ? __wake_up_common+0x59/0x90 Nov 6 16:27:27 localhost kernel: [<ffffffff8109b5ce>] ? prepare_to_wait+0x4e/0x80 Nov 6 16:27:27 localhost kernel: [<ffffffffa0186085>] fuse_request_send+0xe5/0x290 [fuse] Nov 6 16:27:27 localhost kernel: [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40 Nov 6 16:27:27 localhost kernel: [<ffffffffa018bff6>] fuse_flush+0x106/0x140 [fuse] Nov 6 16:27:27 localhost kernel: [<ffffffff81185b6c>] filp_close+0x3c/0x90 Nov 6 16:27:27 localhost kernel: [<ffffffff81185c65>] sys_close+0xa5/0x100 Nov 6 16:27:27 localhost kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Nov 6 16:27:27 localhost kernel: INFO: task gvfsd-metadata:2515 blocked for more than 120 seconds. Nov 6 16:27:27 localhost kernel: Not tainted 2.6.32-430.el6.x86_64 #1 Nov 6 16:27:27 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 6 16:27:27 localhost kernel: gvfsd-metadat D 0000000000000001 0 2515 1 0x00000080 Nov 6 16:27:27 localhost kernel: ffff8800378c5e38 0000000000000082 ffff8800378c5dd8 ffffffff81054839 Nov 6 16:27:27 localhost kernel: ffff880213a02890 0000000300000001 00000000000000d0 ffff880215700840 Nov 6 16:27:27 localhost kernel: ffff8802152ea638 ffff8800378c5fd8 000000000000fbc8 ffff8802152ea638 Nov 6 16:27:27 localhost kernel: Call Trace: Nov 6 16:27:27 localhost kernel: [<ffffffff81054839>] ? __wake_up_common+0x59/0x90 Nov 6 16:27:27 localhost kernel: [<ffffffff8109b5ce>] ? prepare_to_wait+0x4e/0x80 Nov 6 16:27:27 localhost kernel: [<ffffffffa0186085>] fuse_request_send+0xe5/0x290 [fuse] Nov 6 16:27:27 localhost kernel: [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40 Nov 6 16:27:27 localhost kernel: [<ffffffffa018bff6>] fuse_flush+0x106/0x140 [fuse] Nov 6 16:27:27 localhost kernel: [<ffffffff81185b6c>] filp_close+0x3c/0x90 Nov 6 16:27:27 localhost kernel: [<ffffffff81185c65>] sys_close+0xa5/0x100 Nov 6 16:27:27 localhost kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
I am also seeing this during FS sanity bonnie++ runs as well. Bonnie was taking a terribly long time to complete and when I logged in to check on things things I see: INFO: task df:3147 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. df D 0000000000000002 0 3147 2901 0x00000084 ffff8800961abce8 0000000000000086 ffff8800961abc88 ffffffff81051439 ffff8800961abc78 0000000300000000 ffff8800961abc88 ffff880152e0d040 ffff880004dfe638 ffff8800961abfd8 000000000000fb88 ffff880004dfe638 Call Trace: [<ffffffff81051439>] ? __wake_up_common+0x59/0x90 [<ffffffff8109708e>] ? prepare_to_wait+0x4e/0x80 [<ffffffffa02c9075>] fuse_request_send+0xe5/0x290 [fuse] [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa02d1c0a>] fuse_statfs+0xea/0x160 [fuse] [<ffffffff811b3ec4>] statfs_by_dentry+0x74/0xa0 [<ffffffff811b3ffb>] vfs_statfs+0x1b/0xb0 [<ffffffff811b42c0>] do_statfs_native+0x20/0xb0 [<ffffffff811b4452>] sys_statfs+0x72/0xb0 [<ffffffff811813e2>] ? vfs_write+0x132/0x1a0 [<ffffffff810dc937>] ? audit_syscall_entry+0x1d7/0x200 [<ffffffff810dc685>] ? __audit_syscall_exit+0x265/0x290 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Shanks/Ben Turner, Could you provide the statedump of the GlusterFS native mount process that is hung? Along with this, it would be good to have the log files as well. Would it be possible to see if the issue is seen on other RHEL versions, say 6.4? Ben, Did you see this behaviour in the same RHEL/kernel/fuse versions as Shanks?
Created attachment 821038 [details] statedump of the GlusterFS native mount process
Shanks, Can you also provide statedumps of brick and quotad processes? It seems like an issue in the code path where brick queries quotad for aggregated size of the directory, while enforcing quota limits. regards, Raghavendra.
I have seen this on volumes even without quota enabled.
Created attachment 821450 [details] gluster volume statedump shanks-quota all
Created attachment 821451 [details] kill -USR1 <pidof quotad>
Created attachment 822324 [details] gluster volume statedump shanks-vol all
Created attachment 822329 [details] rhs-new-files-brick3.29861.dump.1384162483
RHEL 6.4 client: Nov 11 11:06:52 localhost kernel: INFO: task nautilus:2674 blocked for more than 120 seconds. Nov 11 11:06:52 localhost kernel: Not tainted 2.6.32-430.el6.x86_64 #1 Nov 11 11:06:52 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 11 11:06:52 localhost kernel: nautilus D 0000000000000001 0 2674 16615 0x00000080 Nov 11 11:06:52 localhost kernel: ffff880212579e38 0000000000000086 0000000000000000 ffffffff81054839 Nov 11 11:06:52 localhost kernel: ffff88020daad728 0000000300000001 00000000000000d0 ffff88021560ac40 Nov 11 11:06:52 localhost kernel: ffff88020db5e638 ffff880212579fd8 000000000000fbc8 ffff88020db5e638 Nov 11 11:06:52 localhost kernel: Call Trace: Nov 11 11:06:52 localhost kernel: [<ffffffff81054839>] ? __wake_up_common+0x59/0x90 Nov 11 11:06:52 localhost kernel: [<ffffffffa00e0085>] fuse_request_send+0xe5/0x290 [fuse] Nov 11 11:06:52 localhost kernel: [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40 Nov 11 11:06:52 localhost kernel: [<ffffffffa00e5ff6>] fuse_flush+0x106/0x140 [fuse] Nov 11 11:06:52 localhost kernel: [<ffffffff81185b6c>] filp_close+0x3c/0x90 Nov 11 11:06:52 localhost kernel: [<ffffffff81185c65>] sys_close+0xa5/0x100 Nov 11 11:06:52 localhost kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Version: glusterfs-3.4.0.42rhs-1.el6_4.x86_64
I could not reproduce this with volume where quota is not enabled. Hence updating bug summary to reflect this.
I cannot reproduce this with the reproducer steps that were given in the mail (steps reproduced below). I've tried to follow the steps as closely as possible. -------- The steps given --------- 1. Create distributed replicate volume. 2. Enable quota and set deem-statfs to on and set limit. [root@server1 ~]# gluster vol quota shanks-vol list Path Hard-limit Soft-limit Used Available -------------------------------------------------------------------------------- / 200.0GB 80% 137.6GB 62.4GB /shanks/Downloads 175.0GB 80% 137.6GB 37.4GB 3. Mount shanks-vol on /home in RHEL 6.4 client 4. In RHEL 6.4 client, do "useradd shanks", "passwd shanks". 5. Launch GUI and navigate "Downloads" directory using nautilus. 6. So I downloaded ISO's which are 2.5G and over from http://rhsqe-repo.lab.eng.blr.redhat.com/iso/windows/ 7. Try to have "Downloads" directory with ~30G of data. (I downloaded the same file twice/thrice) 8. Using nautilus, CTRL+a -> CTRL+c -> CTRL+p on the same "Downloads" directory. 9. You should see the hang when the copying is at around ~5GB followed by backtraces in /var/log/messages. --------- end of the steps ----------- -------- My test ----------- I used glusterfs-3.4.0.40rhs-2.el6rhs rpms. Created a 2x2 dist-rep volume over 4 nodes, enabled quota on it, set deem-statfs and set the limits. [root@minion2 ~]# gluster volume info test Volume Name: test Type: Distributed-Replicate Volume ID: b85a03e4-d8dd-4704-99d4-3db251535390 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: minion2.usersys.redhat.com:/brick/test Brick2: minion3.usersys.redhat.com:/brick/test Brick3: minion4.usersys.redhat.com:/brick/test Brick4: minion5.usersys.redhat.com:/brick/test Options Reconfigured: features.quota-deem-statfs: on features.quota: on [root@minion2 ~]# gluster volume quota test list Path Hard-limit Soft-limit Used Available -------------------------------------------------------------------------------- / 100.0GB 80% 47.4GB 52.6GB /Downloads 75.0GB 80% 47.4GB 27.6GB To test the exact steps given, I created a rhel-6.4 workstation vm, with 4GB of RAM and 2 cores. I installed rhel-64 client rpms of version glusterfs-3.4.0.40rhs-1.el6_4 on it. I mounted the volume on /mnt and copied some isos onto it to create an initial dataset of ~25GB. Then I launched nautilus, navigated to /mnt/Downloads, and did Ctrl-a,c,v to create a duplicate copy of the data. All the files got duplicated successfully. I then tried duplicating the ~50GB of data, which also happened until the quota limit was hit. I didn't get hangs at 5GB as given in the steps. I didn't do the additional useradd steps and the home mount as given in the steps. That may have caused the issue, but I'm pretty sure that it isn't the case.
Don't see the hang or backtraces any more. Good catch! :) Verified: glusterfs-server-3.4.0.44rhs-1.el6rhs.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1769.html