Bug 764743 (GLUSTER-3011) - Uninterruptible processes writing(reading ? ) to/from glusterfs share
Summary: Uninterruptible processes writing(reading ? ) to/from glusterfs share
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: GLUSTER-3011
Product: GlusterFS
Classification: Community
Component: quick-read
Version: 3.2.0
Hardware: x86_64
OS: Linux
urgent
high
Target Milestone: ---
Assignee: Raghavendra G
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-06-09 12:54 UTC by Matus
Modified: 2018-11-29 12:00 UTC (History)
21 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-03-29 19:28:23 UTC
Regression: ---
Mount Type: fuse
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
dump of gluster client on server with hung apache processes, io-cache on, stat-prefetch on, io-thread-count=16 (597.42 KB, application/octet-stream)
2011-06-14 11:26 UTC, Jiri Lunacek
no flags Details
dump of gluster client on server with hung apache processes, io-cache off, stat-prefetch off, io-thread-count=64 (142.53 KB, application/octet-stream)
2011-06-14 11:27 UTC, Jiri Lunacek
no flags Details
Proposed fix (1022 bytes, application/octet-stream)
2011-06-20 06:50 UTC, Raghavendra G
no flags Details
Patch to aid debugging (24.27 KB, application/octet-stream)
2011-06-20 06:53 UTC, Raghavendra G
no flags Details
Volume logfile on system freeze (83.14 KB, text/plain)
2011-07-02 14:08 UTC, Andreas Klein
no flags Details
program to reproduce the bug (1.43 KB, text/x-chdr)
2011-07-04 06:12 UTC, Raghavendra Bhat
no flags Details
program to reproduce the bug (9.99 KB, text/x-csrc)
2011-07-04 06:12 UTC, Raghavendra Bhat
no flags Details
USR1 dump after freeze but before kill -9 (61.98 KB, application/octet-stream)
2012-03-14 09:44 UTC, yu.valery+bugzilla
no flags Details
USR1 dump after freeze and after issuing kill -9 (62.37 KB, application/octet-stream)
2012-03-14 09:45 UTC, yu.valery+bugzilla
no flags Details
dump of gluster client on server with hung nginx processes (119.39 KB, text/plain)
2018-11-29 08:28 UTC, kiwi
no flags Details

Description Matus 2011-06-09 12:54:50 UTC
Hi, 

sometimes we've on some client-servers hanging uninterruptible processes ("ps aux" stat is on "D" ).

I'm are not able to kill such processes - also "kill -9" doesnt work - only solution is machine reboot.

Glusterfs itself is working ok, i can see and work with shared files(even after process freeze) , only process ( in this case php-fpm ) is completly frozen.

Kernel is 2.6.34, gluster 3.2.0, fuse 2.8.5. There is nothing in the dmesg of logs.

Machine is brand new installation, share is mounted with -t glusterfs option.
Installation is standard, based on 3.2 documentation, no special patches/hack or exotic configuration.

Exactly same installation  ( 1:1 ) except gluster is working 100% ok.

thanks

Matus

Comment 1 Pranith Kumar K 2011-06-10 05:29:31 UTC
hi,
   Would it be possible for you to give a specific test case which leads to the hang, so that we can try and reproduce this issue, in house. Please do a kill -USR1 <glusterfs, glusterfsd pids> when the hang happens and attach the logs to this bug.

Thanks for raising the issue.
Pranith

Comment 2 Matus 2011-06-10 06:22:17 UTC
Hi

i'm not able to replicate the process, it happend once a week, my testing php script for massive read/write operation test didn't kill it. Somebody told me that it can be related with io-cache and i should try to turn it off but i dont know how ...

Matus

Comment 3 Christopher 2011-06-10 08:56:07 UTC
http://gluster.org/pipermail/gluster-users/2011-June/007916.html
http://gluster.org/pipermail/gluster-users/2011-June/007918.html


disabling the io-cache and fuse-upgrade did not solved the problem.

Comment 4 Anand Avati 2011-06-11 06:49:39 UTC
Can you post the output of process state dump (kill -USR1 <glusterfsd pid> and bzip2 -9 the dump file in /tmp/glusterdump.<pid>) at the time when a process is hung. Also post the pid of the hanging process.

(In reply to comment #3)
> http://gluster.org/pipermail/gluster-users/2011-June/007916.html
> http://gluster.org/pipermail/gluster-users/2011-June/007918.html
> 
> 
> disabling the io-cache and fuse-upgrade did not solved the problem.

Comment 5 markus fröhlich 2011-06-14 05:00:16 UTC
here you can find the process state dump from "kill -USR1 <gfs-pid>":
http://www.xidras.com/logfiles/glusterdump.4027.bz2

it's 22MB big, but the limit for uploads is 1MB - so I put it under the URL above.
this is the dump from the backup-server using rsync.

Comment 6 Matus 2011-06-14 06:48:10 UTC
Hi, 

so it happen again
i did try 

kill -USR1 <glusterfs pid>

Which kill glusterfs but no output in gluster log or system log or console

- php-fpm was still frozen

then i did

kill -USR1 <glusterd pid>

which did nothing, not even terminating glusterd.
After i did kill -9 <glusterd pid>, glusterd disapear and php-fpm was released to kill - which was now working ok

Matus

Comment 7 Jiri Lunacek 2011-06-14 11:26:10 UTC
Created attachment 515

Comment 8 Jiri Lunacek 2011-06-14 11:27:59 UTC
Created attachment 516

Comment 9 Jiri Lunacek 2011-06-14 11:35:35 UTC
Same behaviour here. As I noted in the gluster-users list, setting io-cache off does not eliminate the issue, it just narrows down the number of cases the issue shows.

Way to reproduce:
Gluster replica 2 tcp setup
CentOS 5.6 client
2.6.18-238.9.1.el5
fuse-2.7.4-8.el5
glusterfs-fuse-3.2.1-1
glusterfs-core-3.2.1-1
httpd-2.2.17-135

httpd virtual host with files on the gluster-fuse mounted volume.


run: for i in {1..1000}; do echo "http://server.address.com/file/hosted/from/gluster.jpg"; done | xargs -L1 -P20 wget -O/dev/null

This results in several httpd processes hung in ioctl "sync_page" (unkillable).

If you have any trouble reproducing the issue, I can prepare a VPS for tests of this issue in our cluster with ssh access.

Comment 10 Anand Avati 2011-06-15 04:06:30 UTC
The process state dump shows that there is a missing frame in qr_readv.

As a work-around, you can re-enable the stat-prefetch and io-cache translators, and disable the quick-read translator instead.

Raghu, please look into this.

Jiri, can you confirm whether the workaround was successful for you?

Avati

Comment 11 Jiri Lunacek 2011-06-15 05:02:34 UTC
I can confirm that disabling quick-read and re-enabling io-cache and stat-prefetch removed the problem completely.

Thank you for the work-around. Good work. I am looking forward to the fix. 

Write back if you need any more info.

(In reply to comment #10)
> The process state dump shows that there is a missing frame in qr_readv.
> 
> As a work-around, you can re-enable the stat-prefetch and io-cache translators,
> and disable the quick-read translator instead.
> 
> Raghu, please look into this.
> 
> Jiri, can you confirm whether the workaround was successful for you?
> 
> Avati

(In reply to comment #10)
> The process state dump shows that there is a missing frame in qr_readv.
> 
> As a work-around, you can re-enable the stat-prefetch and io-cache translators,
> and disable the quick-read translator instead.
> 
> Raghu, please look into this.
> 
> Jiri, can you confirm whether the workaround was successful for you?
> 
> Avati

Comment 12 Matus 2011-06-15 05:59:58 UTC
How can i disable "quick-read" ? I can't find it anywhere in gluster 3.2 documentation. Nothing in gluster volume set option too ...

is there big performance problem with this feature turned off ?

Comment 13 Anand Avati 2011-06-15 06:28:21 UTC
(In reply to comment #12)
> How can i disable "quick-read" ? I can't find it anywhere in gluster 3.2
> documentation. Nothing in gluster volume set option too ...
> 
> is there big performance problem with this feature turned off ?

gluster volume option <volname> set performance.quick-read off

Comment 14 Jiri Lunacek 2011-06-15 06:41:47 UTC
(In reply to comment #12)
> How can i disable "quick-read" ? I can't find it anywhere in gluster 3.2
> documentation. Nothing in gluster volume set option too ...

I used:
gluster volume set <volume name> quick-read off

> is there big performance problem with this feature turned off ?

We use tcp 1Gbps interconnect sharing millions of small files.
We have not encountered any performance turn-down after turning this translator off.
Surprisingly (since it disabled the bug) it lead to a performance boost of our webserver.

Comment 15 markus fröhlich 2011-06-15 11:02:41 UTC
I also can confirm, that disabling the quick-read translator solved the hanging procs. on our backupservers it also looks good at this time - maybe it's the workaround for the rsyncs also.

great! :)

Comment 16 Raghavendra G 2011-06-20 06:48:31 UTC
(In reply to comment #9)
> Same behaviour here. As I noted in the gluster-users list, setting io-cache off
> does not eliminate the issue, it just narrows down the number of cases the
> issue shows.
> 
> Way to reproduce:
> Gluster replica 2 tcp setup
> CentOS 5.6 client
> 2.6.18-238.9.1.el5
> fuse-2.7.4-8.el5
> glusterfs-fuse-3.2.1-1
> glusterfs-core-3.2.1-1
> httpd-2.2.17-135
> 
> httpd virtual host with files on the gluster-fuse mounted volume.
> 
> 
> run: for i in {1..1000}; do echo
> "http://server.address.com/file/hosted/from/gluster.jpg"; done | xargs -L1 -P20
> wget -O/dev/null
> 
> This results in several httpd processes hung in ioctl "sync_page" (unkillable).
> 
> If you have any trouble reproducing the issue, I can prepare a VPS for tests of
> this issue in our cluster with ssh access.

We are not able to reproduce the issue locally. Can you apply the attached two patches and let us know about results. In case if the issue is not fixed, can you get process statedump of glusterfs? You can use the following command to get process statedump:

# kill -SIGUSR1 <glusterfs-client-pid>

The statedump can be found in /tmp/glusterdump.<glusterfs-client-pid>

regards,
Raghavendra

Comment 17 Raghavendra G 2011-06-20 06:50:01 UTC
Created attachment 525

Comment 18 Raghavendra G 2011-06-20 06:53:11 UTC
Created attachment 526

Comment 19 Raghavendra G 2011-06-20 06:54:28 UTC
Please make sure you've loaded quick-read translator too. Also please get the process state dump only after you find some processes accessing mount point are in 'D' state.

Comment 20 Matus 2011-06-20 13:45:43 UTC
Patch is for client or for server  ( or both ) ?

Comment 21 Raghavendra G 2011-06-20 14:19:01 UTC
Its for client.

Comment 22 Anand Avati 2011-06-22 12:40:28 UTC
PATCH: http://patches.gluster.com/patch/7556 in master (performance/quick-read: reset open_in_transit to zero in case of an error.)

Comment 23 Anand Avati 2011-06-22 12:40:51 UTC
PATCH: http://patches.gluster.com/patch/7580 in master (performance/quick-read: Perform error handling only when GF_CALLOC fails)

Comment 24 Anand Avati 2011-06-22 12:42:13 UTC
PATCH: http://patches.gluster.com/patch/7583 in release-3.2 (performance/quick-read: Perform error handling only when GF_CALLOC fails)

Comment 25 Anand Avati 2011-06-22 12:42:21 UTC
PATCH: http://patches.gluster.com/patch/7584 in release-3.2 (performance/quick-read: reset open_in_transit to zero in case of an error.)

Comment 26 Andreas Klein 2011-07-02 09:09:20 UTC
The problem persists even with even with quick-read off and these settings:

performance.io-thread-count: 64
performance.io-cache: on
performance.cache-size: 64MB
performance.stat-prefetch: on
performance.quick-read: off
performance.write-behind-window-size: 2097152

If logged in via ssh, a system freeze can sometimes be prevented by stopping Apache. top shows the glusterfs process for the www-volume with high load.

Suprisingly, the logs for the volume and the bricks are empty around the freeze time.

In the background, Gluster is obviously still running, the log contains quite some errors.

Comment 27 Vijay Bellur 2011-07-02 09:54:09 UTC
It would help if you can provide us the process state dump from the glusterfs process(es) when you observe this hang. Process state dump can be obtained by sending SIGUSR1 to glusterfs (#kill -USR1 <glusterfs_pid>). This results in a file called /tmp/glusterdump.<glusterfs_pid>. We would need that dump file for further analysis.

Comment 28 Andreas Klein 2011-07-02 09:58:45 UTC
I would like to help you, but...

if this happens, one can observe a fast increasing load dramatically slowing down the system until it does not respond anymore to use input or action.

When logged in by chance, I am busy to safe the productive system. Killing Gluster would require a reboot and in most times and extensive rebuild of the mysql master-master-replication with the second server.

When not logged in, I have no chance at all... until the Nagios alert messages arrive, everything is gone and only the provider hard reset (sometimes a repeated reset is necessary) will help.

Comment 29 Vijay Bellur 2011-07-02 10:58:51 UTC
Can you please attach the logs from the client and servers after a hang is seen?

Comment 30 Andreas Klein 2011-07-02 14:05:40 UTC
There are only two servers which are also clients.

Comment 31 Andreas Klein 2011-07-02 14:08:51 UTC
Created attachment 539 [details]
IFCONFIG

Comment 32 Andreas Klein 2011-07-02 17:06:34 UTC
The effect just happened again, by stopping Apache I could prevent a crash.

In the Apache volume, there are no entries, the only entries in with timely relevance are in the mail volume logfile.

Nevertheless, I have comparable log entries also in the www volume log:

[2011-07-02 21:59:41.840356] W [inode.c:1035:inode_path] 0-vmail/inode: no dentry for non-root inode 2683152: c8d295ae-1687-47cb-baa9-421201e6d50a
[2011-07-02 21:59:41.840411] W [fuse-bridge.c:508:fuse_getattr] 0-glusterfs-fuse: 7312449: GETATTR 140388767689200 (fuse_loc_fill() failed)

And this entry repeated many times.

Comment 33 Andreas Klein 2011-07-03 06:56:54 UTC
Yesterday and today, I made some further observations, which might help...

GlusterFS is used for server-server-file system synchronisation.

From the first point of view, I had the impression, that GlusterFS has some sort of internal problem and causes Apache prefork processes to freeze in state D.

In such a case, system load increased suddenly to huge values (100-300), making the system laggy and unuseable, mostly requiring a hard reset. 

io-cache was on (swichting to off did not improve the situation), quick-read was off, stat-prefetch was on, io-thread-count set to 64, cache-size is 64 MB and write-behind-widow-size is 2097152.

The system run smooth until I took the GlusterFS productive and run Apache on the Gluster partion (before, I did only a rsync onto it). 

Yesterday, I could save the server several times by shutting down Apache just in time when system load was increasing.

I added rules to Nagios to perform exactly this shutdown in case of high load detection. 

The high load resulted from glusterfs and glusterfsd processed (still running) and Apache processed running and in state D.

Further investigations showed, that the high load occured always when the number of system connections on the second IP (used for clamav mirror having presently an overall inacceptable traffic of 12-15 GB/h) raised from normal overall 400-500 to 2000-4500 on the mirror port 80. When stopping Apache, the number of overall connetions also dropped again to a normal level.

With that result, Nagios problem handling was further extended to block port 80 on that IP in case of inacceptable number of connections by an iptables rule.

Since then, Nagios reported every now and then the increase of connections and the port on the trouble IP is closed. 

System load remained since then acceptable and the high load values did not occur anymore.

My assumption:

The huge number of connections caused problems to GlusterFS/Fuse while syncing with the second server and ensuring a consistent file volume. The accessing Apache process request could not be completed by Gluster/Fuse and the process freezed in state D. The increasing number of requests (by present and new users) caused Apache to create further processes which also freezed. As until yesterday, no proper measures were implemented to cut down the number of connections, the available system ressource were finally eaten up and the system became laggy and freezed (busy with itself waiting for file system sync).

For GlusterFS, this problem is critical, but not easy solveable. If this is the real reason, it would require to implement QoS between the servers to prioritize Gluster communication over other network traffic. This can be achieved in a small server setup with dedicated network cards for the internal network, but not with hosted root servers possibly at different hosters.

Closing the port or even shutting down Apache is not the preferred solution, but always better than a system freeze with manual reset request.

Comment 34 Raghavendra Bhat 2011-07-04 06:12:04 UTC
Created attachment 540 [details]
ROUTE -N

Comment 35 Raghavendra Bhat 2011-07-04 06:12:49 UTC
Created attachment 541 [details]
patch to make run-parts show which script generated output

Comment 36 Raghavendra Bhat 2011-07-04 06:15:59 UTC
The program which is in the last 2 attachments will reproduce the bug if executed simultaneously from 2 glusterfs mounts (some times for the first iteration itself it gets reproduced, some times has to be run 2-3 times). Way to check is the application is hung (cannot be killed by Ctrl+c). And the statedump of the client process where the application is hung indicates the call is stuck in quick-read.

Ran the same program 2-3 times with for an hour and did not get the application hang after the patch. Moving it to resolved state. Please reopen if found again.

Comment 37 Anand Avati 2011-07-12 11:43:22 UTC
PATCH: http://patches.gluster.com/patch/7558 in release-3.0 (performance/quick-read: reset open_in_transit to zero in case of an error.)

Comment 38 Anand Avati 2011-07-12 11:43:34 UTC
PATCH: http://patches.gluster.com/patch/7557 in release-3.1 (performance/quick-read: reset open_in_transit to zero in case of an error.)

Comment 39 Anand Avati 2011-07-29 05:28:07 UTC
CHANGE: http://review.gluster.com/51 (Change-Id: I7a1e2cae3de8794b252ebbf0de7ffab5ba2900d1) merged in release-3.1 by Anand Avati (avati)

Comment 40 zaterio 2011-10-19 15:54:56 UTC
(In reply to comment #33)
> Yesterday and today, I made some further observations, which might help...
> 
> GlusterFS is used for server-server-file system synchronisation.
> 
> From the first point of view, I had the impression, that GlusterFS has some
> sort of internal problem and causes Apache prefork processes to freeze in state
> D.
> 
> In such a case, system load increased suddenly to huge values (100-300), making
> the system laggy and unuseable, mostly requiring a hard reset. 
> 
> io-cache was on (swichting to off did not improve the situation), quick-read
> was off, stat-prefetch was on, io-thread-count set to 64, cache-size is 64 MB
> and write-behind-widow-size is 2097152.
> 
> The system run smooth until I took the GlusterFS productive and run Apache on
> the Gluster partion (before, I did only a rsync onto it). 
> 
> Yesterday, I could save the server several times by shutting down Apache just
> in time when system load was increasing.
> 
> I added rules to Nagios to perform exactly this shutdown in case of high load
> detection. 
> 
> The high load resulted from glusterfs and glusterfsd processed (still running)
> and Apache processed running and in state D.
> 
> Further investigations showed, that the high load occured always when the
> number of system connections on the second IP (used for clamav mirror having
> presently an overall inacceptable traffic of 12-15 GB/h) raised from normal
> overall 400-500 to 2000-4500 on the mirror port 80. When stopping Apache, the
> number of overall connetions also dropped again to a normal level.
> 
> With that result, Nagios problem handling was further extended to block port 80
> on that IP in case of inacceptable number of connections by an iptables rule.
> 
> Since then, Nagios reported every now and then the increase of connections and
> the port on the trouble IP is closed. 
> 
> System load remained since then acceptable and the high load values did not
> occur anymore.
> 
> My assumption:
> 
> The huge number of connections caused problems to GlusterFS/Fuse while syncing
> with the second server and ensuring a consistent file volume. The accessing
> Apache process request could not be completed by Gluster/Fuse and the process
> freezed in state D. The increasing number of requests (by present and new
> users) caused Apache to create further processes which also freezed. As until
> yesterday, no proper measures were implemented to cut down the number of
> connections, the available system ressource were finally eaten up and the
> system became laggy and freezed (busy with itself waiting for file system
> sync).
> 
> For GlusterFS, this problem is critical, but not easy solveable. If this is the
> real reason, it would require to implement QoS between the servers to
> prioritize Gluster communication over other network traffic. This can be
> achieved in a small server setup with dedicated network cards for the internal
> network, but not with hosted root servers possibly at different hosters.
> 
> Closing the port or even shutting down Apache is not the preferred solution,
> but always better than a system freeze with manual reset request.


(In reply to comment #33)
> Yesterday and today, I made some further observations, which might help...
> 
> GlusterFS is used for server-server-file system synchronisation.
> 
> From the first point of view, I had the impression, that GlusterFS has some
> sort of internal problem and causes Apache prefork processes to freeze in state
> D.
> 
> In such a case, system load increased suddenly to huge values (100-300), making
> the system laggy and unuseable, mostly requiring a hard reset. 
> 
> io-cache was on (swichting to off did not improve the situation), quick-read
> was off, stat-prefetch was on, io-thread-count set to 64, cache-size is 64 MB
> and write-behind-widow-size is 2097152.
> 
> The system run smooth until I took the GlusterFS productive and run Apache on
> the Gluster partion (before, I did only a rsync onto it). 
> 
> Yesterday, I could save the server several times by shutting down Apache just
> in time when system load was increasing.
> 
> I added rules to Nagios to perform exactly this shutdown in case of high load
> detection. 
> 
> The high load resulted from glusterfs and glusterfsd processed (still running)
> and Apache processed running and in state D.
> 
> Further investigations showed, that the high load occured always when the
> number of system connections on the second IP (used for clamav mirror having
> presently an overall inacceptable traffic of 12-15 GB/h) raised from normal
> overall 400-500 to 2000-4500 on the mirror port 80. When stopping Apache, the
> number of overall connetions also dropped again to a normal level.
> 
> With that result, Nagios problem handling was further extended to block port 80
> on that IP in case of inacceptable number of connections by an iptables rule.
> 
> Since then, Nagios reported every now and then the increase of connections and
> the port on the trouble IP is closed. 
> 
> System load remained since then acceptable and the high load values did not
> occur anymore.
> 
> My assumption:
> 
> The huge number of connections caused problems to GlusterFS/Fuse while syncing
> with the second server and ensuring a consistent file volume. The accessing
> Apache process request could not be completed by Gluster/Fuse and the process
> freezed in state D. The increasing number of requests (by present and new
> users) caused Apache to create further processes which also freezed. As until
> yesterday, no proper measures were implemented to cut down the number of
> connections, the available system ressource were finally eaten up and the
> system became laggy and freezed (busy with itself waiting for file system
> sync).
> 
> For GlusterFS, this problem is critical, but not easy solveable. If this is the
> real reason, it would require to implement QoS between the servers to
> prioritize Gluster communication over other network traffic. This can be
> achieved in a small server setup with dedicated network cards for the internal
> network, but not with hosted root servers possibly at different hosters.
> 
> Closing the port or even shutting down Apache is not the preferred solution,
> but always better than a system freeze with manual reset request.

I suffered the same problem as Andreas: a large number of connections to apache (4000 to 6000) caused a two replicated volume is unusable: confirmed in 2 x 24 cores server 64 Gb RAM. High load, increased numbers of apache threads.

Comment 41 chendequan 2011-10-31 10:34:00 UTC
Has this been fixed in 3.2.4?

I got same problem in 3.2.4. Here is my tech info:

OS:
CentOS Linux release 6.0 (Final) - Virtual machine on SUSE Linux Enterprise Server 11 (x86_64)

Here is dmesg log:
Oct 31 18:23:55 ct-testing-adam-dr01 kernel: INFO: task glusterfs:1725 blocked for more than 120 seconds.
Oct 31 18:23:55 ct-testing-adam-dr01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 31 18:23:55 ct-testing-adam-dr01 kernel: glusterfs     D ffff880109c23280     0  1725      1 0x00000000
Oct 31 18:23:55 ct-testing-adam-dr01 kernel: ffff8800ee2879b8 0000000000000086 0000000000000000 ffff8800ee287958
Oct 31 18:23:55 ct-testing-adam-dr01 kernel: 800000000e928025 ffffea00025d5948 000000000000000e 0000000110fbcd2c
Oct 31 18:23:55 ct-testing-adam-dr01 kernel: ffff8800ed143028 ffff8800ee287fd8 0000000000010518 ffff8800ed143028
Oct 31 18:23:55 ct-testing-adam-dr01 kernel: Call Trace:
Oct 31 18:23:55 ct-testing-adam-dr01 kernel: [<ffffffff8110c060>] ? sync_page+0x0/0x50
Oct 31 18:23:55 ct-testing-adam-dr01 kernel: [<ffffffff814c8a23>] io_schedule+0x73/0xc0
Oct 31 18:23:55 ct-testing-adam-dr01 kernel: [<ffffffff8110c09d>] sync_page+0x3d/0x50
Oct 31 18:23:55 ct-testing-adam-dr01 kernel: [<ffffffff814c914a>] __wait_on_bit_lock+0x5a/0xc0
Oct 31 18:23:55 ct-testing-adam-dr01 kernel: [<ffffffff8110c037>] __lock_page+0x67/0x70
Oct 31 18:23:55 ct-testing-adam-dr01 kernel: [<ffffffff81091ce0>] ? wake_bit_function+0x0/0x50
Oct 31 18:23:55 ct-testing-adam-dr01 kernel: [<ffffffff81121dc5>] ? put_page+0x25/0x40
Oct 31 18:23:55 ct-testing-adam-dr01 kernel: [<ffffffff8115b730>] lock_page+0x30/0x40
Oct 31 18:23:55 ct-testing-adam-dr01 kernel: [<ffffffff8115bdad>] migrate_pages+0x59d/0x5d0
Oct 31 18:23:55 ct-testing-adam-dr01 kernel: [<ffffffff81152470>] ? compaction_alloc+0x0/0x370
Oct 31 18:23:55 ct-testing-adam-dr01 kernel: [<ffffffff81151f1c>] compact_zone+0x4ac/0x5e0
Oct 31 18:23:55 ct-testing-adam-dr01 kernel: [<ffffffff8111cd1c>] ? get_page_from_freelist+0x15c/0x820
Oct 31 18:23:55 ct-testing-adam-dr01 kernel: [<ffffffff811522ce>] compact_zone_order+0x7e/0xb0
Oct 31 18:23:55 ct-testing-adam-dr01 kernel: [<ffffffff81152409>] try_to_compact_pages+0x109/0x170
Oct 31 18:23:55 ct-testing-adam-dr01 kernel: [<ffffffff8111e62c>] __alloc_pages_nodemask+0x55c/0x810
Oct 31 18:23:55 ct-testing-adam-dr01 kernel: [<ffffffff81150374>] alloc_pages_vma+0x84/0x110
Oct 31 18:23:55 ct-testing-adam-dr01 kernel: [<ffffffff8113ef50>] ? anon_vma_prepare+0x30/0x160
Oct 31 18:23:55 ct-testing-adam-dr01 kernel: [<ffffffff811673b5>] do_huge_pmd_anonymous_page+0x135/0x360
Oct 31 18:23:55 ct-testing-adam-dr01 kernel: [<ffffffff81136455>] handle_mm_fault+0x245/0x2b0
Oct 31 18:23:55 ct-testing-adam-dr01 kernel: [<ffffffff814cd4d3>] do_page_fault+0x123/0x3a0
Oct 31 18:23:55 ct-testing-adam-dr01 kernel: [<ffffffff814caf45>] page_fault+0x25/0x30
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: INFO: task glusterfs:1725 blocked for more than 120 seconds.
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: glusterfs     D ffff880109c23280     0  1725      1 0x00000000
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: ffff8800ee2879b8 0000000000000086 0000000000000000 ffff8800ee287958
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: 800000000e928025 ffffea00025d5948 000000000000000e 0000000110fbcd2c
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: ffff8800ed143028 ffff8800ee287fd8 0000000000010518 ffff8800ed143028
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: Call Trace:
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff8110c060>] ? sync_page+0x0/0x50
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff814c8a23>] io_schedule+0x73/0xc0
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff8110c09d>] sync_page+0x3d/0x50
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff814c914a>] __wait_on_bit_lock+0x5a/0xc0
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff8110c037>] __lock_page+0x67/0x70
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff81091ce0>] ? wake_bit_function+0x0/0x50
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff81121dc5>] ? put_page+0x25/0x40
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff8115b730>] lock_page+0x30/0x40
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff8115bdad>] migrate_pages+0x59d/0x5d0
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff81152470>] ? compaction_alloc+0x0/0x370
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff81151f1c>] compact_zone+0x4ac/0x5e0
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff8111cd1c>] ? get_page_from_freelist+0x15c/0x820
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff811522ce>] compact_zone_order+0x7e/0xb0
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff81152409>] try_to_compact_pages+0x109/0x170
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff8111e62c>] __alloc_pages_nodemask+0x55c/0x810
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff81150374>] alloc_pages_vma+0x84/0x110
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff8113ef50>] ? anon_vma_prepare+0x30/0x160
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff811673b5>] do_huge_pmd_anonymous_page+0x135/0x360
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff81136455>] handle_mm_fault+0x245/0x2b0
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff814cd4d3>] do_page_fault+0x123/0x3a0
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff814caf45>] page_fault+0x25/0x30
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: INFO: task glusterfs:1728 blocked for more than 120 seconds.
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: glusterfs     D ffff880109c23680     0  1728      1 0x00000000
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: ffff8800ed981df0 0000000000000086 0000000000000000 0000000000000000
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: 0000000000000000 ffff8800319b2940 ffff880104c2eab0 0000000110fc1b1e
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: ffff8800ed7ee6b8 ffff8800ed981fd8 0000000000010518 ffff8800ed7ee6b8
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: Call Trace:
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff814c8286>] ? thread_return+0x4e/0x778
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff81095da3>] ? __hrtimer_start_range_ns+0x1a3/0x430
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff814ca6b5>] rwsem_down_failed_common+0x95/0x1d0
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff810960af>] ? hrtimer_try_to_cancel+0x3f/0xd0
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff814ca846>] rwsem_down_read_failed+0x26/0x30
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff81264224>] call_rwsem_down_read_failed+0x14/0x30
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff814c9d44>] ? down_read+0x24/0x30
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff814cd6fa>] do_page_fault+0x34a/0x3a0
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff814caf45>] page_fault+0x25/0x30
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: INFO: task glusterfs:1731 blocked for more than 120 seconds.
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: glusterfs     D ffff880109c23280     0  1731      1 0x00000000
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: ffff8800ee8efe08 0000000000000086 0000000000000000 ffffffff812223f5
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: ffff8800ee71c480 ffffffff00000001 ffff880000000004 0000000110fc0170
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: ffff8800ed13da58 ffff8800ee8effd8 0000000000010518 ffff8800ed13da58
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: Call Trace:
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff812223f5>] ? process_measurement+0xc5/0xf0
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff814ca6b5>] rwsem_down_failed_common+0x95/0x1d0
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff814ca813>] rwsem_down_write_failed+0x23/0x30
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff81264253>] call_rwsem_down_write_failed+0x13/0x20
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff814c9d12>] ? down_write+0x32/0x40
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff8112c03c>] sys_mmap_pgoff+0x5c/0x2a0
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff81018129>] sys_mmap+0x29/0x30
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff81013172>] system_call_fastpath+0x16/0x1b
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: INFO: task pgrep:17342 blocked for more than 120 seconds.
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: pgrep         D ffff880109c23480     0 17342  17341 0x00000000
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: ffff880017cc9ca0 0000000000000086 0000000000000000 ffff880017cc9c18
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: ffffffff811d08dc ffff880017cc9c28 ffffffff8112d14d 0000000110fc17c6
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: ffff8800441505f8 ffff880017cc9fd8 0000000000010518 ffff8800441505f8
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: Call Trace:
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff811d08dc>] ? task_dumpable+0x3c/0x60
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff8112d14d>] ? zone_statistics+0x7d/0xa0
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff8117a47d>] ? do_lookup+0x7d/0x220
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff814ca6b5>] rwsem_down_failed_common+0x95/0x1d0
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff8117b179>] ? __link_path_walk+0x729/0x1040
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff814ca846>] rwsem_down_read_failed+0x26/0x30
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff81264224>] call_rwsem_down_read_failed+0x14/0x30
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff814c9d44>] ? down_read+0x24/0x30
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff811369cc>] access_process_vm+0x4c/0x200
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff81222440>] ? ima_file_check+0x20/0x30
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff8117d32d>] ? do_filp_open+0x60d/0xd40
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff811d0cfd>] proc_pid_cmdline+0x6d/0x120
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff811502a7>] ? alloc_pages_current+0x87/0xd0
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff811d1e9d>] proc_info_read+0xad/0xf0
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff8116d085>] vfs_read+0xb5/0x1a0
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff8116d1c1>] sys_read+0x51/0x90
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff81013172>] system_call_fastpath+0x16/0x1b
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: INFO: task ps:17374 blocked for more than 120 seconds.
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: ps            D ffff880109c23680     0 17374      1 0x00000004
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: ffff88004cd7fca0 0000000000000082 0000000000000000 ffff88004cd7fc18
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: ffffffff811d08dc ffff88004cd7fc28 ffffffff8112d14d 0000000110fd46af
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: ffff8800412f7a98 ffff88004cd7ffd8 0000000000010518 ffff8800412f7a98
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: Call Trace:
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff811d08dc>] ? task_dumpable+0x3c/0x60
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff8112d14d>] ? zone_statistics+0x7d/0xa0
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff8117a47d>] ? do_lookup+0x7d/0x220
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff814ca6b5>] rwsem_down_failed_common+0x95/0x1d0
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff8117b179>] ? __link_path_walk+0x729/0x1040
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff814ca846>] rwsem_down_read_failed+0x26/0x30
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff81264224>] call_rwsem_down_read_failed+0x14/0x30
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff814c9d44>] ? down_read+0x24/0x30
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff811369cc>] access_process_vm+0x4c/0x200
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff81222440>] ? ima_file_check+0x20/0x30
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff8117d32d>] ? do_filp_open+0x60d/0xd40
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff811d0cfd>] proc_pid_cmdline+0x6d/0x120
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff811502a7>] ? alloc_pages_current+0x87/0xd0
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff811d1e9d>] proc_info_read+0xad/0xf0
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff8116d085>] vfs_read+0xb5/0x1a0
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff8116d1c1>] sys_read+0x51/0x90
Oct 31 18:25:54 ct-testing-adam-dr01 kernel: [<ffffffff81013172>] system_call_fastpath+0x16/0x1b
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: INFO: task glusterfs:1725 blocked for more than 120 seconds.
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: glusterfs     D ffff880109c23280     0  1725      1 0x00000000
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: ffff8800ee2879b8 0000000000000086 0000000000000000 ffff8800ee287958
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: 800000000e928025 ffffea00025d5948 000000000000000e 0000000110fbcd2c
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: ffff8800ed143028 ffff8800ee287fd8 0000000000010518 ffff8800ed143028
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: Call Trace:
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff8110c060>] ? sync_page+0x0/0x50
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff814c8a23>] io_schedule+0x73/0xc0
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff8110c09d>] sync_page+0x3d/0x50
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff814c914a>] __wait_on_bit_lock+0x5a/0xc0
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff8110c037>] __lock_page+0x67/0x70
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff81091ce0>] ? wake_bit_function+0x0/0x50
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff81121dc5>] ? put_page+0x25/0x40
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff8115b730>] lock_page+0x30/0x40
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff8115bdad>] migrate_pages+0x59d/0x5d0
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff81152470>] ? compaction_alloc+0x0/0x370
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff81151f1c>] compact_zone+0x4ac/0x5e0
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff8111cd1c>] ? get_page_from_freelist+0x15c/0x820
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff811522ce>] compact_zone_order+0x7e/0xb0
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff81152409>] try_to_compact_pages+0x109/0x170
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff8111e62c>] __alloc_pages_nodemask+0x55c/0x810
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff81150374>] alloc_pages_vma+0x84/0x110
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff8113ef50>] ? anon_vma_prepare+0x30/0x160
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff811673b5>] do_huge_pmd_anonymous_page+0x135/0x360
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff81136455>] handle_mm_fault+0x245/0x2b0
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff814cd4d3>] do_page_fault+0x123/0x3a0
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff814caf45>] page_fault+0x25/0x30
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: INFO: task glusterfs:1728 blocked for more than 120 seconds.
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: glusterfs     D ffff880109c23680     0  1728      1 0x00000000
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: ffff8800ed981df0 0000000000000086 0000000000000000 0000000000000000
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: 0000000000000000 ffff8800319b2940 ffff880104c2eab0 0000000110fc1b1e
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: ffff8800ed7ee6b8 ffff8800ed981fd8 0000000000010518 ffff8800ed7ee6b8
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: Call Trace:
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff814c8286>] ? thread_return+0x4e/0x778
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff81095da3>] ? __hrtimer_start_range_ns+0x1a3/0x430
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff814ca6b5>] rwsem_down_failed_common+0x95/0x1d0
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff810960af>] ? hrtimer_try_to_cancel+0x3f/0xd0
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff814ca846>] rwsem_down_read_failed+0x26/0x30
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff81264224>] call_rwsem_down_read_failed+0x14/0x30
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff814c9d44>] ? down_read+0x24/0x30
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff814cd6fa>] do_page_fault+0x34a/0x3a0
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff814caf45>] page_fault+0x25/0x30
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: INFO: task glusterfs:1731 blocked for more than 120 seconds.
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: glusterfs     D ffff880109c23280     0  1731      1 0x00000000
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: ffff8800ee8efe08 0000000000000086 0000000000000000 ffffffff812223f5
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: ffff8800ee71c480 ffffffff00000001 ffff880000000004 0000000110fc0170
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: ffff8800ed13da58 ffff8800ee8effd8 0000000000010518 ffff8800ed13da58
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: Call Trace:
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff812223f5>] ? process_measurement+0xc5/0xf0
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff814ca6b5>] rwsem_down_failed_common+0x95/0x1d0
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff814ca813>] rwsem_down_write_failed+0x23/0x30
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff81264253>] call_rwsem_down_write_failed+0x13/0x20
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff814c9d12>] ? down_write+0x32/0x40
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff8112c03c>] sys_mmap_pgoff+0x5c/0x2a0
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff81018129>] sys_mmap+0x29/0x30
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff81013172>] system_call_fastpath+0x16/0x1b
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: INFO: task pgrep:17342 blocked for more than 120 seconds.
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: pgrep         D ffff880109c23480     0 17342  17341 0x00000000
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: ffff880017cc9ca0 0000000000000086 0000000000000000 ffff880017cc9c18
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: ffffffff811d08dc ffff880017cc9c28 ffffffff8112d14d 0000000110fc17c6
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: ffff8800441505f8 ffff880017cc9fd8 0000000000010518 ffff8800441505f8
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: Call Trace:
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff811d08dc>] ? task_dumpable+0x3c/0x60
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff8112d14d>] ? zone_statistics+0x7d/0xa0
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff8117a47d>] ? do_lookup+0x7d/0x220
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff814ca6b5>] rwsem_down_failed_common+0x95/0x1d0
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff8117b179>] ? __link_path_walk+0x729/0x1040
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff814ca846>] rwsem_down_read_failed+0x26/0x30
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff81264224>] call_rwsem_down_read_failed+0x14/0x30
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff814c9d44>] ? down_read+0x24/0x30
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff811369cc>] access_process_vm+0x4c/0x200
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff81222440>] ? ima_file_check+0x20/0x30
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff8117d32d>] ? do_filp_open+0x60d/0xd40
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff811d0cfd>] proc_pid_cmdline+0x6d/0x120
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff811502a7>] ? alloc_pages_current+0x87/0xd0
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff811d1e9d>] proc_info_read+0xad/0xf0
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff8116d085>] vfs_read+0xb5/0x1a0
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff8116d1c1>] sys_read+0x51/0x90
Oct 31 18:27:52 ct-testing-adam-dr01 kernel: [<ffffffff81013172>] system_call_fastpath+0x16/0x1b
Oct 31 18:45:47 ct-testing-adam-dr01 GlusterFS[18301]: [2011-10-31 18:45:47.231943] C [rdma.c:3942:rdma_init] 0-rpc-transport/rdma: No IB devices found

Comment 42 Andreas Klein 2011-10-31 14:09:22 UTC
With some suprise, I see "VERIFIED FIXED".

The last trial in September with the current available versions of Gluster (3.2.4 and 3.3 beta) caused the same catastrophic problems, Gluster caused a complete crash of two newly installed and setup OpenSuse servers.

For that reason, I dropped Gluster completely and switched to DRBD/OCFS2. 

With the desired result, the system is running smooth and reliable since then.

Comment 43 Christopher 2011-11-02 04:33:21 UTC
(In reply to comment #42)
> With some suprise, I see "VERIFIED FIXED".
> 
> The last trial in September with the current available versions of Gluster
> (3.2.4 and 3.3 beta) caused the same catastrophic problems, Gluster caused a
> complete crash of two newly installed and setup OpenSuse servers.
> 
> For that reason, I dropped Gluster completely and switched to DRBD/OCFS2. 
> 
> With the desired result, the system is running smooth and reliable since then.



we had the same problems!
gluster is not usable in production....

Comment 44 Joe Julian 2011-11-04 15:13:16 UTC
Could this be the same as bug 764964?

Comment 45 Joe Julian 2011-11-04 15:21:59 UTC
Andreas Klein - Though you didn't offer anything useful in your bug comment, if I extrapolate it looks like what you're trying to say is that you no longer need bug updates. You can remove yourself from the CC list by clicking "edit", highlighting your email address, and checking the "Remove selected CCs" box then just "Save Changes"

Christopher - Along that same line, if by "not ready for production" your asking for the importance to be raised to "blocker", then ask for that.

Comment 46 Philip P 2012-02-01 14:26:28 UTC
We run gluster 3.2.5 (recently updated from 3.2.1) with the servers on squeeze without openvz 2.6.32-5-amd64 and clients on the same kernel with -openvz or 2.6.32-bpo.5-openvz-amd64 on lenny. On all clients that are heavily used (aka nginx for image delivery) we experience the problem that processes hang completely (until system reboot) after a day, or two. Mounting with NFS seems to delay the first occurence of the problem to up to a week or more.

The files are being concurrently accessed by other mounts (image saving, resizing etc) that are not affected.

We use a distributed-replicated setup with 12 bricks on four machines, with tcp transport.

Options Reconfigured:
performance.stat-prefetch: on
performance.io-cache: on
performance.io-thread-count: 64
performance.cache-size: 64MB
performance.quick-read: off

echo t > /proc/sysrq-trigger was used after a hang to produce several variations of the following pattern in dmesg:

[154432.301769] nginx         D ffff88080a609000     0   918    916 0x00000004
[154432.301811]  ffff88083d486000 0000000000000086 0000000000000000 ffff88083bafca80
[154432.301873]  ffff88083ba27c00 ffff8805d6a113c0 000000000000fa40 ffff8805e2db1fd8
[154432.301936]  0000000000016940 0000000000016940 ffff88080a609000 ffff88080a6092f8
[154432.301998] Call Trace:
[154432.302026]  [<ffffffff812ea34b>] ? __mutex_lock_common+0x122/0x192
[154432.302062]  [<ffffffff812ea473>] ? mutex_lock+0x1a/0x31
[154432.302096]  [<ffffffff810f8bf5>] ? do_lookup+0xa0/0x178
[154432.302130]  [<ffffffff810f96df>] ? __link_path_walk+0x689/0x811
[154432.302165]  [<ffffffff810f99ef>] ? path_walk+0x44/0x85
[154432.302199]  [<ffffffff810fad0f>] ? do_path_lookup+0x20/0x77
[154432.302234]  [<ffffffff810fc053>] ? user_path_at+0x48/0x79
[154432.302268]  [<ffffffff8122e775>] ? sys_recvfrom+0xba/0x120
[154432.302302]  [<ffffffff810f45dc>] ? vfs_fstatat+0x2c/0x57
[154432.302336]  [<ffffffff810f46cf>] ? sys_newstat+0x11/0x30
[154432.302371]  [<ffffffff8104a322>] ? default_wake_function+0x0/0x9
[154432.302406]  [<ffffffff81010c12>] ? system_call_fastpath+0x16/0x1b

Interestingly, after the hang occurred, the glusterfs mount was still available and functional, with glusterfs in state S - it might have recovered. With 3.2.1, the glusterfs mount usually wasn't in a usable state after a hang - an ls would hang irrevocably like the nginx processes. It appears the update to 3.2.5 could be responsible for this particular change of behaviour.

While we love the flexibility of glusterfs, this bug is a deal breaker as it either produces downtime on a weekly basis, or the need for ugly and maintenance intensive workarounds, such as always having a second machine ready and reboot the crashed one.

We would gladly be of assistance to hammer this bug out, as the next step for us would be to just throw it in the bin and buy a regular NAS.

Is there work being done? Does a bugfix exist that can be tried? A workaround? An older version that most assuredly does not have this problem?

Comment 47 Philip P 2012-02-01 19:24:20 UTC
Small update: after someone proposing on IRC to try and not use sendfile with nginx, we did that, and the problem persists. Another hang within 12 hours from my last report. I created a dump from the glusterfs process which I will send to rgowdapp(at)redhatcom momentarily.

Comment 48 Jeff Darcy 2012-02-02 15:19:50 UTC
do_huge_pmd_anonymous_page

I see the following sequence in one of the traces:

 compaction_alloc+0x0/0x370
 compact_zone+0x4ac/0x5e0
 get_page_from_freelist+0x15c/0x820
 compact_zone_order+0x7e/0xb0
 try_to_compact_pages+0x109/0x170
 __alloc_pages_nodemask+0x55c/0x810
 alloc_pages_vma+0x84/0x110
 anon_vma_prepare+0x30/0x160
 do_huge_pmd_anonymous_page+0x135/0x360

This seems suspiciously similar to https://bugzilla.redhat.com/show_bug.cgi?id=764964

 [<ffffffff81152b20>] ? compaction_alloc+0x0/0x370
 [<ffffffff811525cc>] compact_zone+0x4cc/0x600
 [<ffffffff8111cffc>] ? get_page_from_freelist+0x15c/0x820
 [<ffffffff8115297e>] compact_zone_order+0x7e/0xb0
 [<ffffffff81152ab9>] try_to_compact_pages+0x109/0x170
 [<ffffffff8111e99d>] __alloc_pages_nodemask+0x5ed/0x850
 [<ffffffff810c6b88>] ? start_callback+0xb8/0xd0
 [<ffffffff810c6a35>] ? finish_callback+0xa5/0x140
 [<ffffffff810c8058>] ? finish_report+0x78/0xe0
 [<ffffffff81150db3>] alloc_pages_vma+0x93/0x150
 [<ffffffff81167f15>] do_huge_pmd_anonymous_page+0x135/0x340

It's not *exactly* the same, but it does make me wonder whether this problem might be similarly related to hugepages.  Try turning them off, however that's done in ancient Debian, or perhaps a newer kernel.

Comment 49 Philip P 2012-02-13 05:07:21 UTC
After a week and a half using NFS only on all clients, I am happy to report that NFS seems to be able to handle the concurrency level we need that gluster-fuse can not.

However, glusters NFS implementation seems to have a memory leak or two and requires a restart every two weeks or so. Also, no failover server with NFS.

Comment 50 Damian Tylczyński 2012-02-17 17:43:50 UTC
Same problems as for guys above. Do we need

Comment 51 Jeff Darcy 2012-03-13 12:02:33 UTC
This bug hasn't been updated in nearly a month.  Is anyone still seeing this problem *with hugepages turned off*?  If not, I'm going to mark this as a duplicate of bug764964 (or vice versa), so they can be tracked together.

Comment 52 yu.valery+bugzilla 2012-03-14 09:42:39 UTC
It seems that I have the same or similar problem in my very simple setup.
I'm using gluster as NFS replacement with a single server/brick (home NAS), because of it better performance on large files and smaller protocol overhead.

Recently, I started experiencing lock-ups in one application accessing gluster (ktorrent torrent client). After it "freezes" I can't kill it with kill -9. Only reboot helps. Interestingly, the mounted volume is still accessible and other apps can read it without problem. For instance issuing "find" on the whole volume does not lead to a problem.

I tried to disable quick-read translator with no effect.

My software versions:
Server: kernel 2.6.39 armv5tel, glusterfs 3.2.4
Client: kernel 3.2.9 i686, glusterfs 3.2.4, fuse 2.8.5

I have made client state snapshots with kill -SIGUSR1 after lock-up but before I issued kill -9 and after (with no success).

After issuing kill -9 the kernel starts repeatedly complaining that the app is frozen:
[35041.008261] INFO: task ktorrent:2891 blocked for more than 480 seconds.
[35041.008269] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[35041.008275] ktorrent        D c0abf060     0  2891      1 0x00000004
[35041.008286]  dff01f14 00200082 c0244388 c0abf060 c0b8a340 c0abf060 c0b8a340 c0b8a340
[35041.008304]  a6b9d6c8 00001f3e f4407340 dfd2a030 eeb305b0 00000000 00000000 00000003
[35041.008319]  dfc56034 dfc56030 00000003 00000001 dff01f04 c023116c 00000000 00000000
[35041.008335] Call Trace:
[35041.008373]  [<f9b816b5>] request_wait_answer+0xb5/0x1f0 [fuse]
[35041.008405]  [<f9b81861>] fuse_request_send+0x71/0xa0 [fuse]
[35041.008426]  [<f9b8926c>] fuse_flush+0xdc/0x100 [fuse]
[35041.008485]  [<c032f23e>] filp_close+0x2e/0x80
[35041.008496]  [<c032f2fb>] sys_close+0x6b/0xc0
[35041.008507]  [<c072bb5d>] syscall_call+0x7/0xb
[35041.008536]  [<b66625d5>] 0xb66625d4

Comment 53 yu.valery+bugzilla 2012-03-14 09:44:28 UTC
Created attachment 569940 [details]
USR1 dump after freeze but before kill -9

USR1 dump after freeze but before kill -9 of ktorrent accessing gluster volume. Other apps can still access without problem.

Comment 54 yu.valery+bugzilla 2012-03-14 09:45:50 UTC
Created attachment 569941 [details]
USR1 dump after freeze and after issuing kill -9

USR1 dump after freeze and after kill -9 of ktorrent accessing gluster volume. Other apps can still access without problem. After this point kernel starts complaining about frozen task. Nothing can help it but reboot.

Comment 55 Jeff Darcy 2012-03-23 18:46:49 UTC
Seems to me that write-behind is pretty strongly implicated here.  Turning that off might help.  Also *were hugepages turned off*?

Comment 56 yu.valery+bugzilla 2012-03-25 16:38:15 UTC
Hi,

After updating kernel and glusterfs client I cannot reproduce the locking situation (might be because the load pattern is different now). I have a feeling that my problem was related to https://bugzilla.redhat.com/show_bug.cgi?id=GLUSTER-3679 .

BTW I have automatic hugepages enabled in kernel 3.3.0 and seen no problems connected with that so far.

Anyway, I take my report off the table (since I cannot reproduce it anymore).

Comment 57 Jeff Darcy 2012-03-26 13:19:15 UTC
Thank you, yu.valery.  It looks like we have three different issues that have (rather unfortunately) been associated with this bug.

(1) Something to do with large numbers of threads (up to about comment 40) which hasn't been mentioned since October of last year..

(2) A likely dup of bug764964 (e.g. comment 41).

(3) A possible dup of bug765411 (comments 52 onward).

While it would be nice to have a more certain resolution of the first problem, without more information we can't even know whether it still exists in code which has undergone significant change in the last five months.  Is there anything left to justify leaving this bug open?

Comment 58 Jeff Darcy 2012-03-29 19:28:23 UTC
The patches in comments 37-39 seem to have fixed the first symptom, and the other two are duplicates.  If the first symptom reappears, we can always reopen.

Comment 59 Andreas T. 2015-12-22 12:53:58 UTC
# rpm -qa | grep gluster
glusterfs-api-3.6.3-1.el6.x86_64
glusterfs-cli-3.6.3-1.el6.x86_64
glusterfs-fuse-3.6.3-1.el6.x86_64
glusterfs-3.6.3-1.el6.x86_64
glusterfs-server-3.6.3-1.el6.x86_64
glusterfs-libs-3.6.3-1.el6.x86_64

Same issue as described above with nginx though. Stopping one of the glusterfs peers brick makes everything go normal but if its started again problem persists.

I have two replicated bricks and use glusterfs for images.

Let me know if you need me to perform any troubleshooting steps.

Comment 60 kiwi 2018-11-29 08:28:14 UTC
Created attachment 1509754 [details]
dump of gluster client on server with hung nginx processes

Comment 61 kiwi 2018-11-29 08:47:21 UTC
I am facing the same issue.
The glusterfs version is glusterfs 4.1.5
I tried to reset all the custom configuration and the disable the performance.quick-read 
performance.io-cache off
performance.stat-prefetch

and transparent huge pages.

But running a test 
for i in {1..10000}; do echo '--header "Host:www.example.com" http://127.0.0.1/uploads/2017/11/941c1a4c3fb.jpg'; done | xargs -L1 -P20 wget -O/dev/null

The client glusterfs process cpu is 100%.After the cpu grows to 200%.The nginx process is in D state and freeze.

Comment 62 kiwi 2018-11-29 08:49:29 UTC
I am facing the same issue.
The glusterfs version is glusterfs 4.1.5
I tried to reset all the custom configuration and the disable the performance.quick-read 
performance.io-cache off
performance.stat-prefetch

and transparent huge pages.

But running a test 
for i in {1..10000}; do echo '--header "Host:www.example.com" http://127.0.0.1/uploads/2017/11/941c1a4c3fb.jpg'; done | xargs -L1 -P20 wget -O/dev/null

The client glusterfs process cpu is 100%.After the cpu grows to 200%.The nginx process is in D state and freeze.

Comment 63 kiwi 2018-11-29 08:49:56 UTC
I am facing the same issue.
The glusterfs version is glusterfs 4.1.5
I tried to reset all the custom configuration and the disable the performance.quick-read 
performance.io-cache off
performance.stat-prefetch

and transparent huge pages.

But running a test 
for i in {1..10000}; do echo '--header "Host:www.example.com" http://127.0.0.1/uploads/2017/11/941c1a4c3fb.jpg'; done | xargs -L1 -P20 wget -O/dev/null

The client glusterfs process cpu is 100%.After the cpu grows to 200%.The nginx process is in D state and freeze.

Comment 64 kiwi 2018-11-29 08:50:37 UTC
I am facing the same issue.
The glusterfs version is glusterfs 4.1.5
I tried to reset all the custom configuration and the disable the performance.quick-read 
performance.io-cache off
performance.stat-prefetch

and transparent huge pages.

But running a test 
for i in {1..10000}; do echo '--header "Host:www.example.com" http://127.0.0.1/uploads/2017/11/941c1a4c3fb.jpg'; done | xargs -L1 -P20 wget -O/dev/null

The client glusterfs process cpu is 100%.After the cpu grows to 200%.The nginx process is in D state and freeze.


Note You need to log in before you can comment on or make changes to this bug.