Bug 1066702

Summary:	Hugepage allocations hang on numa nodes with insufficient memory
Product:	Red Hat Enterprise Linux 6	Reporter:	Sterling Alexander <stalexan>
Component:	kernel	Assignee:	Rafael Aquini <aquini>
Status:	CLOSED ERRATA	QA Contact:	Li Wang <liwan>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	6.5	CC:	aquini, kernel-mgr, liwan, loberman, lwoodman, pdwyer, pholasek, stalexan, yanwang
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	kernel-2.6.32-542.el6	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-07-22 08:04:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1075802, 1159933

Description Sterling Alexander 2014-02-18 22:59:55 UTC

Description of problem:

On a system with ~= 80GB on nodes 0 and 1 and ~= 40GB on nodes 2 and 3

[root@dl580g7 ~]# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
node 0 size: 81909 MB
node 0 free: 79969 MB
node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
node 1 size: 81920 MB
node 1 free: 80122 MB
node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
node 2 size: 49152 MB
node 2 free: 48067 MB
node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
node 3 size: 49151 MB
node 3 free: 48040 MB

This issue is apparent when trying and allocate 200GB of the 256GB to hugepages on the 431.5.1 kernel but only if we specify the two nodes and don't have enough memory.
 
i.e via `numactl -m 0-1 echo 102400 > /proc/sys/vm/nr_hugepages_mempolicy`

This will try and allocate 200GB of hugepages on two nodes only having 160GB of memory.  For the 6.5 (431.5.1) kernel, the allocation hangs.  For the 6.2 (220.23.1), as much memory as is available is allocated with no hangs, see the following:

[Figure 1 - Allocation results from 220.23.1 kernel]

[root@dl580g7 ~]# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
node 0 size: 81909 MB
node 0 free: 397 MB                  ---->>> All used up here
node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
node 1 size: 81920 MB
node 1 free: 99 MB                   ----->>> All used up here
node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
node 2 size: 49152 MB
node 2 free: 47280 MB
node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
node 3 size: 49151 MB
node 3 free: 47310 MB

For the 6.5 (431.5.1) kernel, the allocation hangs.  For the 6.2 (220.23.1), as much memory as is available is allocated with no hangs (~160G when we requested 200G).

Version-Release number of selected component (if applicable):  kernel-2.6.32.431.5.1.el6


How reproducible:  Every time


Steps to Reproduce:
1. Install 6.2 kernel
2. Check numa zones with `numactl --hardware`
3. Attempt to over allocate huge pages to a specific zone/zones (`numactl -m 0-1 echo <more hugepages than memory in numa zones> > /proc/sys/vm/nr_hugepages_mempolicy`), this will 'succeed' in allocating as much memory as possible (see Figure 1 above)
4. Install 6.5 kernel
5. Re-do huge page allocation with numactl, this will hang

Actual results:  Hang on allocation


Expected results:  Hard to say whether the current behavior or the previous was more correct.  


Additional info:  This is a change from previous behavior.  I would like some guidance as to whether the behavior from 6.2 (best effort allocation) was more correct or the current 6.5 behavior (hang on over-ambitious allocation) is the preferred result of a 'bad' allocation request.

Comment 2 Petr Holasek 2014-02-20 13:40:55 UTC

Hi Sterling,

thank you for your report, it really seems like buggy behaviour on 6.5 side. How it hangs exatly? Could you please provide dmesg output after hang? I've tried to reserve similar HP DL580 G7 in Beaker but without success, yet. Would it be possible to provide access to your machine?

thanks,
Petr

Comment 3 loberman 2014-02-20 13:59:36 UTC

Hello Petr, I helped the customer capture a dump on boot after hang.
We will get the dump.

Notes:

We captured a crashdump from the boot hang.  In it I see the only active task that is not swapper is sysctl:

crash> files 37729
PID: 37729  TASK: ffff89bfa52a6040  CPU: 50  COMMAND: "sysctl"
ROOT: /    CWD: /
 FD       FILE            DENTRY           INODE       TYPE PATH
  0 ffff884059093540 ffff89bfa7ce7f00 ffff8a3fa6a53108 CHR  /dev/console
  1 ffff8a3fa2bd0b00 ffff88bfa7cbd180 ffff893fa7b0d108 CHR  /dev/null
  2 ffff8a3fa2bd0b00 ffff88bfa7cbd180 ffff893fa7b0d108 CHR  /dev/null
  3 ffff8aafb601c8c0 ffff8aafb34b4780 ffff8aafb7814ca8 REG  /etc/sysctl.conf
  4 ffff8aafb3a48080 ffff8aafb34b4900 ffff8aafb37d7ab8 REG  /proc/sys/vm/nr_hugepages

crash> bt 37729
PID: 37729  TASK: ffff89bfa52a6040  CPU: 50  COMMAND: "sysctl"
 ...
     RIP: ffffffff8152a357  RSP: ffff89bfa0d8da08  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: ffff89bfa0d8da08  RCX: ffffea0c7e632090
    RDX: ffff8b8000019380  RSI: 0000000000000246  RDI: 0000000000000246
    RBP: ffffffff8100bb8e   R8: 0000000000000001   R9: ffff89bfa0d8dae8
    R10: ffff8b8000445e80  R11: 0000000000000000  R12: ffff8b8000444e00
    R13: 0000000000000000  R14: ffff89bfa0d8da08  R15: ffffffff8100bb8e
    ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
#16 [ffff89bfa0d8da10] compact_zone at ffffffff8116a06b
#17 [ffff89bfa0d8dad0] compact_zone_order at ffffffff8116a81c
#18 [ffff89bfa0d8db80] try_to_compact_pages at ffffffff8116a951
#19 [ffff89bfa0d8dbf0] __alloc_pages_direct_compact at ffffffff8112f1ba
#20 [ffff89bfa0d8dc60] __alloc_pages_nodemask at ffffffff8112f69f
#21 [ffff89bfa0d8dda0] alloc_fresh_huge_page at ffffffff81160ede
#22 [ffff89bfa0d8ddd0] set_max_huge_pages at ffffffff81161714
#23 [ffff89bfa0d8de20] hugetlb_sysctl_handler_common at ffffffff81163873
#24 [ffff89bfa0d8de70] hugetlb_sysctl_handler at ffffffff811638ee
#25 [ffff89bfa0d8de80] proc_sys_call_handler at ffffffff811fd6f7
#26 [ffff89bfa0d8dee0] proc_sys_write at ffffffff811fd744
#27 [ffff89bfa0d8def0] vfs_write at ffffffff81188f78
#28 [ffff89bfa0d8df30] sys_write at ffffffff81189871
#29 [ffff89bfa0d8df80] system_call_fastpath at
...

The zonelist at 0xffff8b800002ade0 was passed to __alloc_pages_direct_compact.  That zonelist is node_zonelists[1] of:

crash> pg_data_t.node_id 0xffff8b80000001c0
  node_id = 0x7

So it seems likely the reproduction using numactl coupled with nr_hugepages_mempolicy replicates what is happening during boot on this customer's system.

Let me know if you require the vmcore obtained from the boot hang 2014-02-18.

=- Curt

Comment 4 Petr Holasek 2014-02-26 11:27:15 UTC

Hi Curt,

thanks a lot for the backtrace. I've just loaned similar G7 machine, but if you have a chance, please upload vmcore somewhere.

thanks,
Petr

Comment 5 loberman 2014-02-26 13:44:27 UTC

Petr,

I can likely reproduce and get a forced crash. I was able to reproduce the issue of the hang once booted using numactl -m 0-1 echo 102400 > /proc/sys/vm/nr_hugepages_mempolicy.

I was never able to reproduce the hard hang on boot when simply setting a value for hugepages in /etc/sysctl.conf as long as that fitted into the memory across all numa nodes. I only have 256GB though, and that takes around 5 to 6  minutes to allocate so with 1.2TB that would take some time to complete allocations on boot.

In the numactl -m 0-1 echo 102400 > /proc/sys/vm/nr_hugepages_mempolicy test
the prior kernel version (6.2)  exits out and does not hang but does not produce a warning that it could not allocate the memory asked for.

The newer kernel (6.5) indeed does hang.

Let me know if you want a froced crash after bootup using numactl -m 0-1 echo 102400 > /proc/sys/vm/nr_hugepages_mempolicy.

Here are the notes from my testing
------------------------------------
Testing now on the 431 stock 6.5 kernel and attempting to allocate 230GB so I will cross all 4 numa nodes.

The numactl -m 0-1 echo 102400 > /proc/sys/vm/nr_hugepages_mempolicy hangs in my lab on 6.5 (431.5.1).  It does not hang on 6.2.

Comment 6 Petr Holasek 2014-03-10 14:19:37 UTC

Hello,

(In reply to loberman from comment #5)
> Petr,
> 
> I can likely reproduce and get a forced crash. I was able to reproduce the
> issue of the hang once booted using numactl -m 0-1 echo 102400 >
> /proc/sys/vm/nr_hugepages_mempolicy.
> 
> I was never able to reproduce the hard hang on boot when simply setting a
> value for hugepages in /etc/sysctl.conf as long as that fitted into the
> memory across all numa nodes. I only have 256GB though, and that takes
> around 5 to 6  minutes to allocate so with 1.2TB that would take some time
> to complete allocations on boot.
> 
> In the numactl -m 0-1 echo 102400 > /proc/sys/vm/nr_hugepages_mempolicy test
> the prior kernel version (6.2)  exits out and does not hang but does not
> produce a warning that it could not allocate the memory asked for.
> 
> The newer kernel (6.5) indeed does hang.
> 
> Let me know if you want a froced crash after bootup using numactl -m 0-1
> echo 102400 > /proc/sys/vm/nr_hugepages_mempolicy.

sorry for a late reply. Yes please, I'd love to look into vmcore. Thank you!

> 
> Here are the notes from my testing
> ------------------------------------
> Testing now on the 431 stock 6.5 kernel and attempting to allocate 230GB so
> I will cross all 4 numa nodes.
> 
> The numactl -m 0-1 echo 102400 > /proc/sys/vm/nr_hugepages_mempolicy hangs
> in my lab on 6.5 (431.5.1).  It does not hang on 6.2.

Comment 7 Petr Holasek 2014-04-18 12:53:28 UTC

Hi,

I've completed testing of a few kernels and can confirm that issue is not related to libhugetlbfs. I didn't see the issue with kernel -279 (rhel-6.3), but it appeared on same system with kernel -358 (rhel-6.4). And I was even able to reproduce it on recent vanilla 3.15.0-rc1.

Testing note: I've seen the issue only on machines with > ~100G of memory spreaded among 2+ nodes.

So reassigning to kernel component.

Comment 9 loberman 2014-05-15 15:08:58 UTC

6.5 2.6.32-431.11.2.el6.x86_64
Hangs here:
numactl -m 0-1 echo 102400 > /proc/sys/vm/nr_hugepages_mempolicy

Where are we blocked:

 34.55%  [kernel]                      [k] compact_zone
 19.99%  [kernel]                      [k] get_pageblock_flags_group
 13.44%  [kernel]                      [k] _spin_lock_irqsave
 13.11%  [kernel]                      [k] native_write_msr_safe
  8.21%  [kernel]                      [k] compact_checklock_irqsave
  0.81%  [kernel]                      [k] _spin_unlock_irqrestore
  0.44%  [kernel]                      [k] __reset_isolation_suitable
  0.32%  [kernel]                      [k] tick_nohz_stop_sched_tick

--------------------------------------------------------------------------
6.3  kernel-2.6.32-279.el6.x86_64
This command returns to # 
numactl -m 0-1 echo 102400 > /proc/sys/vm/nr_hugepages_mempolicy

# sysctl -a | grep huge
vm.nr_hugepages = 95993
vm.nr_hugepages_mempolicy = 95993

--------------------------------------------------------------------------
6.4 kernel kernel-2.6.32-358.el6.x86_64

# sysctl -a | grep huge
vm.nr_hugepages = 32768
vm.nr_hugepages_mempolicy = 32768
         
numactl -m 0-1 echo 102400 > /proc/sys/vm/nr_hugepages_mempolicy
         
Hangs here
   
 59.35%  [kernel]                      [k] _spin_lock_irqsave
 10.97%  [kernel]                      [k] native_write_msr_safe
  8.79%  [kernel]                      [k] compact_zone
  4.40%  [kernel]                      [k] get_pageblock_flags_group
  3.98%  [kernel]                      [k] smp_call_function_many
  1.97%  [kernel]                      [k] compact_checklock_irqsave
  0.83%  [kernel]                      [k] tick_nohz_stop_sched_tick

Forced crash and will find a place to provide vmcore after analyzing.

Thanks
Loberman

Comment 10 loberman 2014-05-15 20:23:23 UTC

With the 6.4 kernel I do see the mempolicy hugepages allocated but we never return from the numactl command.

I forced another crash and can make the vmcore available.

Thanks
loberman

Comment 11 loberman 2014-05-29 20:10:29 UTC

Hello

Please can I have an update, I will also go find Larry Woodman.

Comment 12 loberman 2014-06-03 15:36:10 UTC

Hi Larry,

Any updates to share with the customer yet ?

Thanks again for helping here.

loberman

Comment 30 Rafael Aquini 2015-03-07 05:37:49 UTC

Patch(es) available on kernel-2.6.32-542.el6

Comment 42 errata-xmlrpc 2015-07-22 08:04:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1272.html