Bug 1293742 - glusterfsd crashes with page allocation failures
Summary: glusterfsd crashes with page allocation failures
Keywords:
Status: CLOSED DUPLICATE of bug 1269702
Alias: None
Product: GlusterFS
Classification: Community
Component: core
Version: 3.7.6
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-12-22 22:32 UTC by patrick.glomski
Modified: 2016-05-31 12:32 UTC (History)
4 users (show)

Fixed In Version: glusterfs 3.7.7
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-05-31 12:32:36 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
core dump 1 (672.15 KB, application/x-gzip)
2015-12-22 22:32 UTC, patrick.glomski
no flags Details
core dump 2 (666.92 KB, application/x-gzip)
2015-12-22 22:33 UTC, patrick.glomski
no flags Details
core dump 3 (671.61 KB, application/x-gzip)
2015-12-22 22:33 UTC, patrick.glomski
no flags Details
core dump 4 (1.23 MB, application/x-gzip)
2015-12-22 22:34 UTC, patrick.glomski
no flags Details
dmesg from one of the peers (32.80 KB, text/plain)
2015-12-22 22:35 UTC, patrick.glomski
no flags Details

Description patrick.glomski 2015-12-22 22:32:35 UTC
Created attachment 1108715 [details]
core dump 1

Description of problem:

We've recently upgraded from gluster 3.6.6 to 3.7.6 (3.7.6-1 rpms hosted on download.gluster.org) and have started encountering dmesg page allocation errors. It appears that glusterfsd now sometimes fills up the cache completely and crashes with a page allocation failure. Hosts are all Scientific Linux 6.6 (kernel is 2.6.32-573.12.1.el6.x86_64) connected via infiniband (IP over IB) and these errors occur consistently on two separate gluster pools.

Version-Release number of selected component (if applicable):
glusterfs 3.7.6 built on Nov  9 2015 15:19:41

Additional info:

I will attach core dumps.

Volume info:

Volume Name: gfsbackup
Type: Distribute
Volume ID: e78d5123-d9bc-4d88-9c73-61d28abf0b41
Status: Started
Number of Bricks: 7
Transport-type: tcp
Bricks:
Brick1: gfsib01bkp.corvidtec.com:/data/brick01bkp/gfsbackup
Brick2: gfsib01bkp.corvidtec.com:/data/brick02bkp/gfsbackup
Brick3: gfsib02bkp.corvidtec.com:/data/brick01bkp/gfsbackup
Brick4: gfsib02bkp.corvidtec.com:/data/brick02bkp/gfsbackup
Brick5: gfsib02bkp.corvidtec.com:/data/brick03bkp/gfsbackup
Brick6: gfsib02bkp.corvidtec.com:/data/brick04bkp/gfsbackup
Brick7: gfsib02bkp.corvidtec.com:/data/brick05bkp/gfsbackup

Stack trace from the dmesg:
[1458118.134697] glusterfsd: page allocation failure. order:5, mode:0x20
[1458118.134701] Pid: 6010, comm: glusterfsd Not tainted 2.6.32-573.3.1.el6.x86_64 #1
[1458118.134702] Call Trace:
[1458118.134714]  [<ffffffff8113770c>] ? __alloc_pages_nodemask+0x7dc/0x950
[1458118.134728]  [<ffffffffa0321800>] ? mlx4_ib_post_send+0x680/0x1f90 [mlx4_ib]
[1458118.134733]  [<ffffffff81176e92>] ? kmem_getpages+0x62/0x170
[1458118.134735]  [<ffffffff81177aaa>] ? fallback_alloc+0x1ba/0x270
[1458118.134736]  [<ffffffff811774ff>] ? cache_grow+0x2cf/0x320
[1458118.134738]  [<ffffffff81177829>] ? ____cache_alloc_node+0x99/0x160
[1458118.134743]  [<ffffffff8145f732>] ? pskb_expand_head+0x62/0x280
[1458118.134744]  [<ffffffff81178479>] ? __kmalloc+0x199/0x230
[1458118.134746]  [<ffffffff8145f732>] ? pskb_expand_head+0x62/0x280
[1458118.134748]  [<ffffffff8146001a>] ? __pskb_pull_tail+0x2aa/0x360
[1458118.134751]  [<ffffffff8146f389>] ? harmonize_features+0x29/0x70
[1458118.134753]  [<ffffffff8146f9f4>] ? dev_hard_start_xmit+0x1c4/0x490
[1458118.134758]  [<ffffffff8148cf8a>] ? sch_direct_xmit+0x15a/0x1c0
[1458118.134759]  [<ffffffff8146ff68>] ? dev_queue_xmit+0x228/0x320
[1458118.134762]  [<ffffffff8147665d>] ? neigh_connected_output+0xbd/0x100
[1458118.134766]  [<ffffffff814abc67>] ? ip_finish_output+0x287/0x360
[1458118.134767]  [<ffffffff814abdf8>] ? ip_output+0xb8/0xc0
[1458118.134769]  [<ffffffff814ab04f>] ? __ip_local_out+0x9f/0xb0
[1458118.134770]  [<ffffffff814ab085>] ? ip_local_out+0x25/0x30
[1458118.134772]  [<ffffffff814ab580>] ? ip_queue_xmit+0x190/0x420
[1458118.134773]  [<ffffffff81137059>] ? __alloc_pages_nodemask+0x129/0x950
[1458118.134776]  [<ffffffff814c0c54>] ? tcp_transmit_skb+0x4b4/0x8b0
[1458118.134778]  [<ffffffff814c319a>] ? tcp_write_xmit+0x1da/0xa90
[1458118.134779]  [<ffffffff81178cbd>] ? __kmalloc_node+0x4d/0x60
[1458118.134780]  [<ffffffff814c3a80>] ? tcp_push_one+0x30/0x40
[1458118.134782]  [<ffffffff814b410c>] ? tcp_sendmsg+0x9cc/0xa20
[1458118.134786]  [<ffffffff8145836b>] ? sock_aio_write+0x19b/0x1c0
[1458118.134788]  [<ffffffff814581d0>] ? sock_aio_write+0x0/0x1c0
[1458118.134791]  [<ffffffff8119169b>] ? do_sync_readv_writev+0xfb/0x140
[1458118.134797]  [<ffffffff810a14b0>] ? autoremove_wake_function+0x0/0x40
[1458118.134801]  [<ffffffff8123e92f>] ? selinux_file_permission+0xbf/0x150
[1458118.134804]  [<ffffffff812316d6>] ? security_file_permission+0x16/0x20
[1458118.134806]  [<ffffffff81192746>] ? do_readv_writev+0xd6/0x1f0
[1458118.134807]  [<ffffffff811928a6>] ? vfs_writev+0x46/0x60
[1458118.134809]  [<ffffffff811929d1>] ? sys_writev+0x51/0xd0
[1458118.134812]  [<ffffffff810e88ae>] ? __audit_syscall_exit+0x25e/0x290
[1458118.134816]  [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b

Comment 1 patrick.glomski 2015-12-22 22:33:11 UTC
Created attachment 1108716 [details]
core dump 2

Comment 2 patrick.glomski 2015-12-22 22:33:41 UTC
Created attachment 1108717 [details]
core dump 3

Comment 3 patrick.glomski 2015-12-22 22:34:14 UTC
Created attachment 1108718 [details]
core dump 4

Comment 4 patrick.glomski 2015-12-22 22:35:02 UTC
Created attachment 1108719 [details]
dmesg from one of the peers

Comment 5 Soumya Koduri 2016-01-05 12:20:34 UTC
As mentioned in the below mail link, the fix is available in 3.7.7 release -
http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/13281

Closing the bug

Comment 6 Niels de Vos 2016-05-31 12:32:36 UTC
Pranith confirmed the fix for 3.7.7 in this email:

  http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/13281/focus=13335

The patch http://review.gluster.org/12312 got merged for bug 1269702.

*** This bug has been marked as a duplicate of bug 1269702 ***


Note You need to log in before you can comment on or make changes to this bug.