Bug 842206
Summary: | glusterfsd: page allocation failure | ||
---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Saurabh <saujain> |
Component: | core | Assignee: | Raghavendra Bhat <rabhat> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | pre-release | CC: | amarts, dchinner, esandeen, gluster-bugs, mzywusko, ryszard.lach |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | glusterfs-3.4.0 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2013-07-24 17:17:57 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Saurabh
2012-07-23 05:13:27 UTC
Similar backtrace I am finding with swift-account-server and swift-proxy-server page faults. Also, to be mentioned the /var/log/messages and /var/log/glusterfs/mnt-gluster-AUTH_test.log is not getting updated, even though the requests are been to this same machine. On similar note, I found some page faults with slight change in backtrace glusterfs: page allocation failure. order:1, mode:0x20 Pid: 14544, comm: glusterfs Not tainted 2.6.32-220.23.1.el6.x86_64 #1 Call Trace: <IRQ> [<ffffffff8112415f>] ? __alloc_pages_nodemask+0x77f/0x940 [<ffffffffa0154600>] ? start_xmit+0x30/0x1d0 [virtio_net] [<ffffffff8115e152>] ? kmem_getpages+0x62/0x170 [<ffffffff8115ed6a>] ? fallback_alloc+0x1ba/0x270 [<ffffffff8115eae9>] ? ____cache_alloc_node+0x99/0x160 [<ffffffff8115f8cb>] ? kmem_cache_alloc+0x11b/0x190 [<ffffffff8141fcf8>] ? sk_prot_alloc+0x48/0x1c0 [<ffffffff8141ff82>] ? sk_clone+0x22/0x2e0 [<ffffffff8146d256>] ? inet_csk_clone+0x16/0xd0 [<ffffffff81486143>] ? tcp_create_openreq_child+0x23/0x450 [<ffffffff81483b2d>] ? tcp_v4_syn_recv_sock+0x4d/0x2a0 [<ffffffff81485f01>] ? tcp_check_req+0x201/0x420 [<ffffffff8148354b>] ? tcp_v4_do_rcv+0x35b/0x430 [<ffffffffa0384557>] ? ipv4_confirm+0x87/0x1d0 [nf_conntrack_ipv4] [<ffffffff81484cc1>] ? tcp_v4_rcv+0x4e1/0x860 [<ffffffff81462940>] ? ip_local_deliver_finish+0x0/0x2d0 [<ffffffff81462a1d>] ? ip_local_deliver_finish+0xdd/0x2d0 [<ffffffff81462ca8>] ? ip_local_deliver+0x98/0xa0 [<ffffffff8146216d>] ? ip_rcv_finish+0x12d/0x440 [<ffffffff814626f5>] ? ip_rcv+0x275/0x350 [<ffffffff8142c6ab>] ? __netif_receive_skb+0x49b/0x6f0 [<ffffffff8142e768>] ? netif_receive_skb+0x58/0x60 [<ffffffffa01553ad>] ? virtnet_poll+0x5dd/0x8d0 [virtio_net] [<ffffffff81431013>] ? net_rx_action+0x103/0x2f0 [<ffffffff81072291>] ? __do_softirq+0xc1/0x1d0 [<ffffffff810958b0>] ? hrtimer_interrupt+0x140/0x250 [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30 [<ffffffff8100de85>] ? do_softirq+0x65/0xa0 [<ffffffff81072075>] ? irq_exit+0x85/0x90 [<ffffffff814f5600>] ? smp_apic_timer_interrupt+0x70/0x9b [<ffffffff8100bc13>] ? apic_timer_interrupt+0x13/0x20 <EOI> glusterfs: page allocation failure. order:1, mode:0x20 Pid: 14544, comm: glusterfs Not tainted 2.6.32-220.23.1.el6.x86_64 #1 Call Trace: <IRQ> [<ffffffff8112415f>] ? __alloc_pages_nodemask+0x77f/0x940 [<ffffffff8115e152>] ? kmem_getpages+0x62/0x170 [<ffffffff8115ed6a>] ? fallback_alloc+0x1ba/0x270 [<ffffffff8115eae9>] ? ____cache_alloc_node+0x99/0x160 [<ffffffff8115f8cb>] ? kmem_cache_alloc+0x11b/0x190 [<ffffffff8141fcf8>] ? sk_prot_alloc+0x48/0x1c0 [<ffffffff8141ff82>] ? sk_clone+0x22/0x2e0 [<ffffffff8146d256>] ? inet_csk_clone+0x16/0xd0 [<ffffffff81486143>] ? tcp_create_openreq_child+0x23/0x450 [<ffffffff81483b2d>] ? tcp_v4_syn_recv_sock+0x4d/0x2a0 [<ffffffff81485f01>] ? tcp_check_req+0x201/0x420 [<ffffffff8148354b>] ? tcp_v4_do_rcv+0x35b/0x430 [<ffffffffa0384557>] ? ipv4_confirm+0x87/0x1d0 [nf_conntrack_ipv4] [<ffffffff81484cc1>] ? tcp_v4_rcv+0x4e1/0x860 [<ffffffff81462940>] ? ip_local_deliver_finish+0x0/0x2d0 [<ffffffff81462a1d>] ? ip_local_deliver_finish+0xdd/0x2d0 [<ffffffff81462ca8>] ? ip_local_deliver+0x98/0xa0 [<ffffffff8146216d>] ? ip_rcv_finish+0x12d/0x440 [<ffffffff814626f5>] ? ip_rcv+0x275/0x350 [<ffffffff8142c6ab>] ? __netif_receive_skb+0x49b/0x6f0 [<ffffffff8142e768>] ? netif_receive_skb+0x58/0x60 [<ffffffffa01553ad>] ? virtnet_poll+0x5dd/0x8d0 [virtio_net] [<ffffffff8142c6ab>] ? __netif_receive_skb+0x49b/0x6f0 [<ffffffff81431013>] ? net_rx_action+0x103/0x2f0 [<ffffffffa01541b9>] ? skb_recv_done+0x39/0x40 [virtio_net] [<ffffffff81072291>] ? __do_softirq+0xc1/0x1d0 [<ffffffff810d9740>] ? handle_IRQ_event+0x60/0x170 [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30 [<ffffffff8100de85>] ? do_softirq+0x65/0xa0 [<ffffffff81072075>] ? irq_exit+0x85/0x90 [<ffffffff814f5515>] ? do_IRQ+0x75/0xf0 [<ffffffff8100ba53>] ? ret_from_intr+0x0/0x11 <EOI> [<ffffffff810dde01>] ? rcu_sched_qs+0x1/0x30 [<ffffffff814ece97>] ? schedule+0x47/0x3b2 [<ffffffff8103758c>] ? kvm_clock_read+0x1c/0x20 [<ffffffff81037599>] ? kvm_clock_get_cycles+0x9/0x10 [<ffffffff811777f6>] ? vfs_writev+0x46/0x60 [<ffffffff81177972>] ? sys_writev+0xa2/0xb0 [<ffffffff8100b16a>] ? sysret_careful+0x14/0x17 output of 'free -m' will help. Just by having a look on stack trace, doesn't look like something obvious with glusterfs. Will keep it open, and work on fixing some memory leak issues. Will need to revisit after we fix some of the leaks. I had some issues on the machine, so it was rebooted. I will try to reproduce it and update the results. I have been able to reproduce the issue in a completely different setup. This time I have four Hardware machines with 50GB RAM. [root@gqac028 ~]# glusterfs -V glusterfs 3.3.0 built on Jul 19 2012 14:08:45 Repository revision: git://git.gluster.com/glusterfs.git Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com> GlusterFS comes with ABSOLUTELY NO WARRANTY. You may redistribute copies of GlusterFS under the terms of the GNU General Public License. [root@gqac028 ~]# [root@gqac028 ~]# [root@gqac028 ~]# rpm -qa | grep glusterfs glusterfs-devel-3.3.0-23.el6rhs.x86_64 glusterfs-server-3.3.0-23.el6rhs.x86_64 glusterfs-rdma-3.3.0-23.el6rhs.x86_64 org.apache.hadoop.fs.glusterfs-glusterfs-0.20.2_0.2-1.noarch glusterfs-3.3.0-23.el6rhs.x86_64 glusterfs-fuse-3.3.0-23.el6rhs.x86_64 glusterfs-geo-replication-3.3.0-23.el6rhs.x86_64 [root@gqac028 ~]# ======================================================================= swift-container: page allocation failure. order:1, mode:0x20 Pid: 4768, comm: swift-container Not tainted 2.6.32-220.23.1.el6.x86_64 #1 Call Trace: <IRQ> [<ffffffff8112415f>] ? __alloc_pages_nodemask+0x77f/0x940 [<ffffffff8115e152>] ? kmem_getpages+0x62/0x170 [<ffffffff8115ed6a>] ? fallback_alloc+0x1ba/0x270 [<ffffffff8115eae9>] ? ____cache_alloc_node+0x99/0x160 [<ffffffff8115f8cb>] ? kmem_cache_alloc+0x11b/0x190 [<ffffffff8141fcf8>] ? sk_prot_alloc+0x48/0x1c0 [<ffffffff8141ff82>] ? sk_clone+0x22/0x2e0 [<ffffffff8146d256>] ? inet_csk_clone+0x16/0xd0 [<ffffffff81486143>] ? tcp_create_openreq_child+0x23/0x450 [<ffffffff81483b2d>] ? tcp_v4_syn_recv_sock+0x4d/0x2a0 [<ffffffff81485f01>] ? tcp_check_req+0x201/0x420 [<ffffffff8148354b>] ? tcp_v4_do_rcv+0x35b/0x430 [<ffffffffa031d557>] ? ipv4_confirm+0x87/0x1d0 [nf_conntrack_ipv4] [<ffffffff81484cc1>] ? tcp_v4_rcv+0x4e1/0x860 [<ffffffff81462940>] ? ip_local_deliver_finish+0x0/0x2d0 [<ffffffff81462a1d>] ? ip_local_deliver_finish+0xdd/0x2d0 [<ffffffff81462ca8>] ? ip_local_deliver+0x98/0xa0 [<ffffffff8146216d>] ? ip_rcv_finish+0x12d/0x440 [<ffffffff814626f5>] ? ip_rcv+0x275/0x350 [<ffffffff8142c6ab>] ? __netif_receive_skb+0x49b/0x6f0 [<ffffffff8142e768>] ? netif_receive_skb+0x58/0x60 [<ffffffff8142e870>] ? napi_skb_finish+0x50/0x70 [<ffffffff81430ef9>] ? napi_gro_receive+0x39/0x50 [<ffffffffa014ad4f>] ? bnx2_poll_work+0xd4f/0x1270 [bnx2] [<ffffffff81104f5b>] ? perf_pmu_enable+0x2b/0x40 [<ffffffff8110a808>] ? perf_event_task_tick+0xa8/0x2f0 [<ffffffffa014b2ad>] ? bnx2_poll_msix+0x3d/0xc0 [bnx2] [<ffffffff81431013>] ? net_rx_action+0x103/0x2f0 [<ffffffff81072291>] ? __do_softirq+0xc1/0x1d0 [<ffffffff810d9740>] ? handle_IRQ_event+0x60/0x170 [<ffffffff810722ea>] ? __do_softirq+0x11a/0x1d0 [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30 [<ffffffff8100de85>] ? do_softirq+0x65/0xa0 [<ffffffff81072075>] ? irq_exit+0x85/0x90 [<ffffffff814f5515>] ? do_IRQ+0x75/0xf0 [<ffffffff8100ba53>] ? ret_from_intr+0x0/0x11 <EOI> swift-container: page allocation failure. order:1, mode:0x20 Pid: 4768, comm: swift-container Not tainted 2.6.32-220.23.1.el6.x86_64 #1 Call Trace: <IRQ> [<ffffffff8112415f>] ? __alloc_pages_nodemask+0x77f/0x940 [<ffffffff81053400>] ? select_idle_sibling+0x40/0x150 [<ffffffff8115e152>] ? kmem_getpages+0x62/0x170 [<ffffffff8115ed6a>] ? fallback_alloc+0x1ba/0x270 [<ffffffff8115eae9>] ? ____cache_alloc_node+0x99/0x160 [<ffffffff8115f8cb>] ? kmem_cache_alloc+0x11b/0x190 [<ffffffff8141fcf8>] ? sk_prot_alloc+0x48/0x1c0 [<ffffffff8141ff82>] ? sk_clone+0x22/0x2e0 [<ffffffff8146d256>] ? inet_csk_clone+0x16/0xd0 [<ffffffff81486143>] ? tcp_create_openreq_child+0x23/0x450 [<ffffffff81483b2d>] ? tcp_v4_syn_recv_sock+0x4d/0x2a0 [<ffffffff81485f01>] ? tcp_check_req+0x201/0x420 [<ffffffff8148354b>] ? tcp_v4_do_rcv+0x35b/0x430 [<ffffffffa031d557>] ? ipv4_confirm+0x87/0x1d0 [nf_conntrack_ipv4] [<ffffffff81484cc1>] ? tcp_v4_rcv+0x4e1/0x860 [<ffffffff81462940>] ? ip_local_deliver_finish+0x0/0x2d0 [<ffffffff81462a1d>] ? ip_local_deliver_finish+0xdd/0x2d0 [<ffffffff81462ca8>] ? ip_local_deliver+0x98/0xa0 [<ffffffff8146216d>] ? ip_rcv_finish+0x12d/0x440 [<ffffffff814626f5>] ? ip_rcv+0x275/0x350 [<ffffffff8142c6ab>] ? __netif_receive_skb+0x49b/0x6f0 [<ffffffff8142e768>] ? netif_receive_skb+0x58/0x60 [<ffffffff8142e870>] ? napi_skb_finish+0x50/0x70 [<ffffffff81430ef9>] ? napi_gro_receive+0x39/0x50 [<ffffffffa014ad4f>] ? bnx2_poll_work+0xd4f/0x1270 [bnx2] [<ffffffff81280110>] ? swiotlb_map_page+0x0/0x100 [<ffffffffa014b2ad>] ? bnx2_poll_msix+0x3d/0xc0 [bnx2] [<ffffffff810de937>] ? cpu_quiet_msk+0x77/0x130 [<ffffffff81431013>] ? net_rx_action+0x103/0x2f0 [<ffffffff81072291>] ? __do_softirq+0xc1/0x1d0 [<ffffffff810d9740>] ? handle_IRQ_event+0x60/0x170 [<ffffffff810722ea>] ? __do_softirq+0x11a/0x1d0 [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30 [<ffffffff8100de85>] ? do_softirq+0x65/0xa0 [<ffffffff81072075>] ? irq_exit+0x85/0x90 [<ffffffff814f5515>] ? do_IRQ+0x75/0xf0 [<ffffffff8100ba53>] ? ret_from_intr+0x0/0x11 <EOI> [root@gqac028 ~]# free -m total used free shared buffers cached Mem: 48383 46827 1556 0 129 35999 -/+ buffers/cache: 10697 37686 Swap: 50431 0 50431 [root@gqac028 ~]# Hi. free -m is not enough for explanation. Look at http://utcc.utoronto.ca/~cks/space/blog/linux/WhyPageAllocFailure and than you notice, that important are lines (from my logs): glusterfsd: page allocation failure. order:4, mode:0xc0d0 (Pid: 14168, comm: glusterfsd Not tainted 2.6.32-5-xen-amd64 #1) order:4 means, that glusterfsd tried to allocate 2^4*pageSize=64kB of data [...] Node 0 DMA: 2*4kB 2*8kB 1*16kB 1*32kB 2*64kB 4*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 1*4096kB = 7880kB Node 0 DMA32: 15869*4kB 3667*8kB 614*16kB 40*32kB 21*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 106796kB Node 0 Normal: 8448*4kB 61*8kB 52*16kB 22*32kB 10*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 37480kB Here I see, that my system has 21 of 64KB continuous chunks in DMA32 zone and 10 64KB chunks in Normal zone, but probably page allocation fails because of many 4KB-sized chunks. So, in short, I suppose it is memory fragmentation problem. I have no idea how to deal with it, I'll try to reduce memory assinged for the OS (xen DomU) and see what happens. Cheers, R. Hi, again. We've experimented with various vm settings (see http://community.gluster.org/a/linux-kernel-tuning-for-glusterfs/) . It seems, that that only one parameter that changes anything is vm.vfs_cache_pressure. After increasing it to a huge value (10000) there are noticeably more contiguous pages available (hopefully we will not have problems with dentry/inode read performance). However, the page allocation error happens, but not so frequent as before. We've noticed one more interesting thing. We have a 4 brick setup (stripe + replicate). page allocation failure happens almost every 5 minutes, at the same time, at almost all systems (1 brick = 1 OS). We have also an error related to this one in brick's log: [2012-08-03 09:48:36.898418] I [server3_1-fops.c:823:server_getxattr_cbk] 0-shared-server: 196890: GETXATTR / (system.posix_acl_access) ==> -1 (Cannot allocate memory) We'll check, if it is related to ACL's (we're mounting gluster via native gluster client with acl mount option). Does anybody have an idea why it happens every 5 minutes? R. Hi. Remount without ACL solved the problem. We now have 4 bricks, all without ACL's, with different kernel settings, none of them has page allocation failures since remount. R. When you say 'remount without ACL', are you talking about -oacl for gluster mount or backend filesystem? (which i assume is XFS). Both. "acl" is not a valid mount option for xfs. SELinux: initialized (dev sdb4, type xfs), uses xattr XFS (sdb4): unknown mount option [acl]. so I don't understand "both" - but maybe your backend isn't xfs? All the stack traces point to the ethernet driver stack failing order 1 GFP_ATOMIC allocations during interrupt. I can't see a connection between the failures and the reported filesystem ACL solution... Sorry, I didn't comment your 'xfs' suggestion (did I mention xfs somewhere?) My FS is EXT4. Dave: I'm trying to give you pure facts. 6th day after remount without failures. Before remount - failures every day, almost every 5 minutes (during morning import jobs). Cheers, R. I have a fresh setup: two replicated bricks + georeplication from one of them to another machine. After first georeplication connection (and start of replication of all files to the third machine) I have some page allocation failures (on the master for georep. node): Aug 7 13:06:08 kernel: [445091.204793] Pid: 15480, comm: glusterfsd Not tainted 2.6.32-5-xen-amd64 #1 Aug 7 13:06:08 kernel: [445091.204799] Call Trace: Aug 7 13:06:08 kernel: [445091.204812] [<ffffffff810bb986>] ? __alloc_pages_nodemask+0x59b/0x5fd Aug 7 13:06:08 kernel: [445091.204820] [<ffffffff810ba943>] ? __get_free_pages+0x9/0x46 Aug 7 13:06:08 kernel: [445091.204828] [<ffffffff810e948d>] ? __kmalloc+0x3f/0x141 Aug 7 13:06:08 kernel: [445091.204837] [<ffffffff81107d14>] ? getxattr+0x89/0x117 Aug 7 13:06:08 kernel: [445091.204847] [<ffffffff8100ecdf>] ? xen_restore_fl_direct_end+0x0/0x1 Aug 7 13:06:08 kernel: [445091.204854] [<ffffffff810e84e9>] ? kmem_cache_free+0x72/0xa3 Aug 7 13:06:08 kernel: [445091.204862] [<ffffffff810fb1b4>] ? user_path_at+0x52/0x79 Aug 7 13:06:08 kernel: [445091.204870] [<ffffffff8118f8cf>] ? _atomic_dec_and_lock+0x33/0x50 Aug 7 13:06:08 kernel: [445091.204878] [<ffffffff81107e3d>] ? sys_lgetxattr+0x42/0x5c Aug 7 13:06:08 kernel: [445091.204885] [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b Aug 7 13:06:08 kernel: [445091.204891] Mem-Info: Aug 7 13:06:08 kernel: [445091.204899] Node 0 DMA per-cpu: Aug 7 13:06:08 kernel: [445091.204905] CPU 0: hi: 0, btch: 1 usd: 0 Aug 7 13:06:08 kernel: [445091.204910] Node 0 DMA32 per-cpu: Aug 7 13:06:08 kernel: [445091.204916] CPU 0: hi: 186, btch: 31 usd: 0 Aug 7 13:06:08 kernel: [445091.204924] active_anon:52709 inactive_anon:96198 isolated_anon:24 Aug 7 13:06:08 kernel: [445091.204925] active_file:65598 inactive_file:121637 isolated_file:22 Aug 7 13:06:08 kernel: [445091.204927] unevictable:777 dirty:539 writeback:496 unstable:0 Aug 7 13:06:08 kernel: [445091.204928] free:16038 slab_reclaimable:21131 slab_unreclaimable:4726 Aug 7 13:06:08 kernel: [445091.204930] mapped:2461 shmem:3 pagetables:1169 bounce:0 Aug 7 13:06:08 kernel: [445091.204948] Node 0 DMA free:6068kB min:40kB low:48kB high:60kB active_anon:0kB inactive_anon:176kB active_file:5616kB inactive_file:132kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:12824kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:732kB slab_unreclaimable:228kB kernel_stack:32kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Aug 7 13:06:08 kernel: [445091.204977] lowmem_reserve[]: 0 1499 1499 1499 Aug 7 13:06:08 kernel: [445091.204987] Node 0 DMA32 free:58084kB min:4932kB low:6164kB high:7396kB active_anon:210836kB inactive_anon:384616kB active_file:256776kB inactive_file:486416kB unevictable:3108kB isolated(anon):96kB isolated(file):88kB present:1535200kB mlocked:3108kB dirty:2156kB writeback:1984kB mapped:9844kB shmem:12kB slab_reclaimable:83792kB slab_unreclaimable:18676kB kernel_stack:928kB pagetables:4676kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Aug 7 13:06:08 kernel: [445091.205018] lowmem_reserve[]: 0 0 0 0 Aug 7 13:06:08 kernel: [445091.205028] Node 0 DMA: 45*4kB 60*8kB 50*16kB 16*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 6068kB Aug 7 13:06:08 kernel: [445091.205050] Node 0 DMA32: 13899*4kB 47*8kB 30*16kB 9*32kB 1*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 58084kB and in gluster.log: [2012-08-07 13:06:08.036781] W [client3_1-fops.c:1059:client3_1_getxattr_cbk] 0-foto-client-1: remote operation failed: Cannot allocate memory. Path: /5/2613849468/3ea82d2b9e33fa7f809b6e2a3176ffc0/2613849468_5we.jpg (91986aa3-58ce-4f0a-99d3-628234ffb2fc). Key: trusted.glusterfs.f92a558a-6b55-4908-8542-990f017593e6.xtime I just noticed, that Saurabh had errors with 'order 1' (8kB) and mine are at 'order 4' (64kB). I don't know, if it is the same issue. R. (In reply to comment #13) > Sorry, I didn't comment your 'xfs' suggestion (did I mention xfs somewhere?) > > My FS is EXT4. The current gluster storage product runs on XFS. Hence when "glusterfsd" was seen in the stack traces, Eric assumed you are using XFS. > Dave: I'm trying to give you pure facts. 6th day after remount without > failures. Before remount - failures every day, almost every 5 minutes > (during morning import jobs). I'm not disputing that it made the warnings go away, just that I didn't see the connection. The trace in comment #14 points it out - the ACL code is doing high order (order 5) memory allocations and that is exhausting the machine of contiguous pages, leading to the ethernet driver failing contiguous allocations. That's a memory management problem, not a filesystem or ethernet driver problem... Cheers, Dave. We will keep it open, and see if we are able to reproduce the issue with RHS2.0 (updates) or RHS2.1 testing... If not found till GA date of RHS2.1, will be closing the bug. haven't seen any issues in this regard, should I ask QE to see if happens again. We had some fixes in getxattr()'s memory allocation part, hopefully they fixed the issues: http://review.gluster.com/3640 http://review.gluster.com/3673 http://review.gluster.com/3681 Saurabh, please re-open if seen again. |