Bug 251043

Summary: [RHEL5 U1] [ia64] Kernel test failing under limited memory
Product: Red Hat Enterprise Linux 5 Reporter: Jeff Burke <jburke>
Component: kernelAssignee: Aron Griffis <agriffis>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: medium Docs Contact:
Priority: urgent    
Version: 5.1CC: dchapman, ddomingo, dwalsh, dzickus, lwang, mgahagan, prarit, sgrubb, tyamamot, yuchen
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: ia64   
OS: Linux   
URL: http://rhts.lab.boston.redhat.com/cgi-bin/rhts/test_list.cgi?test_filter=/kernel/misc/ktst_msg&package_filter=kernel-xen&package_arch=ia64&package_version=2.6.18&package_release=38.el5&type=KernelTier1&type=KernelTier2&result=Fail&all_packages=0
Whiteboard: GSSApproved
Fixed In Version: RHBA-2008-0314 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-05-21 14:48:45 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 222082, 412091, 425461    

Description Jeff Burke 2007-08-06 18:16:26 UTC
Description of problem:
 While running the KernelTier1 tests /kernel/misc/ktst_msg we receive faliures.


Version-Release number of selected component (if applicable):
2.6.18-38.el5xen

How reproducible:
Always

Steps to Reproduce:
1. Install RHEL5.1 tree RHEL5.1-Server-20070725.0 install/create xen guest. run
the following RHTS test. rh-tests-kernel-misc-ktst_msg
  
Actual results:
 http://rhts.lab.boston.redhat.com/cgi-bin/rhts/test_log.cgi?id=433557
 http://rhts.lab.boston.redhat.com/cgi-bin/rhts/test_log.cgi?id=433558
 http://rhts.lab.boston.redhat.com/cgi-bin/rhts/test_log.cgi?id=433560
 http://rhts.lab.boston.redhat.com/cgi-bin/rhts/test_log.cgi?id=433562
 http://rhts.lab.boston.redhat.com/cgi-bin/rhts/test_log.cgi?id=433564

Expected results:
 Pass

Additional info:
 This test _ONLY_ fails on a xen guest. It works as expected on a Dom0.

Comment 2 Keiichiro Tokunaga 2007-08-24 17:13:54 UTC
I have done the investigation.

[Summary]
This is not a specific issue of ia64 xen guest, but ia64 generic.  I think there
are possible three ways to resolve this issue:

  1) Increase the memory size
  2) Create a swap device whose size is more than 2GB
  3) Run 'ulimit -Hs unlimited' before the program running

[Details]
Why it occurred only on ia64 xen guest is because the guest met all the three
conditions:
  1) the memory size for the guset was 512MB (Per my quick test, about 700MB or
less is required.)
  2) the swap size for the guest was less than 2GB
  3) the default "Hard limit" for processes on the guest was 2097152k (not
'unlimited')
  (If the three conditions are met, the failures occur on other arches.)

The first two can be applicable to all arches, but the third one is ia64
specific (applies to linux, dom0 and guest.)  The default value of "Hard limit"
for processes is somehow 2097152k on ia64.  ('unlimited' is set for other
arches.)  So, in that sense, this can be considered as an ia64 generic thing.

When a 'make' command is invoked, it sets the same value of "Hard limit" to its
"Soft limit" (maybe that's the spec of make command) and that causes the
failures.  pthread_create() tries to allocate (mmap) memory of the same size of
"Soft limit" and there is not enough so...

However, it seems that the allocation completes successfully if there is more
than 2GB swap, which larger than the "Hard limit" size.  But, I have not had any
clues on this swap thing yet.

Comment 5 Aron Griffis 2007-09-14 04:35:54 UTC
Thanks to Kei for the comprehensive investigation.  Looking at kernel.org, it
appears that the default stack hard limit was removed earlier this year:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d826393cdebe340b3716002bfb1298ab19b57e83

This patch brings ia64 into alignment with x86 and x86_64, neither of which have
hard stack limits by default.  Also Debian stable runs 2.6.18 with this patch,
so it appears to work as expected on the 2.6.18 kernel (we're using Debian
stable on our lab server at HP)

I'm building kernels tonight with this modification and will make them available
for testing in the morning.

Comment 6 Aron Griffis 2007-09-14 05:57:42 UTC
Test kernels available at http://free.linux.hp.com/~agriffis/rhel5/bz251043/

I posted the patch for review on rhkernel-list.  Requested help in that message
getting this submitted to RHTS since I haven't done that before.  When that's
done, I'll change the status to POST

Comment 7 Aron Griffis 2007-09-14 17:21:13 UTC
With help from Kei, I was able to run the RHTS test on dom0 on my
rx6600.

2.6.18-45.el5xen (unpatched)                    -- FAIL
2.6.18-47.el5.bz251043.agriffis.1xen (patched)  -- PASS


Comment 8 RHEL Program Management 2007-11-13 20:56:56 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 13 Don Zickus 2007-11-29 17:05:29 UTC
in 2.6.18-58.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 19 errata-xmlrpc 2008-05-21 14:48:45 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html


Comment 20 chen yuwen 2010-11-02 03:41:11 UTC
fv_* tests all failed on 2.6.18-227.el5xen.

OS: RHEL5.6-20101014.0
Kernel: 2.6.18-227.el5xen
V7: 1.2-25.el5

#virsh console v7ia64

Output for fv_core testing:
----------------------------
...
+----------ia64 CPU info end----------+ 

Checking clock jitter ...
Single CPU detected. No clock jitter testing necessary.
clock direction test: start time 1288605280, stop time 1288605340, sleeptime 60, delta 0
PASSED
audispd invoked oom-killer: gfp_mask=0x200d2, order=0, oomkilladj=0

Call Trace:
 [<a000000100013ba0>] show_stack+0x40/0xa0
                                sp=e00000000206f5c0 bsp=e000000002069468
 [<a000000100013c30>] dump_stack+0x30/0x60
                                sp=e00000000206f790 bsp=e000000002069450
 [<a000000100113a50>] out_of_memory+0xf0/0x780
                                sp=e00000000206f790 bsp=e000000002069418
 [<a00000010011a2c0>] __alloc_pages+0x420/0x540
                                sp=e00000000206f820 bsp=e0000000020693a0
 [<a000000100152290>] alloc_page_vma+0x150/0x180
                                sp=e00000000206f830 bsp=e000000002069368
 [<a000000100145f10>] read_swap_cache_async+0x70/0x220
                                sp=e00000000206f830 bsp=e000000002069320
 [<a00000010012cfa0>] swapin_readahead+0xa0/0x240
                                sp=e00000000206f830 bsp=e0000000020692d8
 [<a0000001001331a0>] __handle_mm_fault+0x1400/0x1d00
                                sp=e00000000206f840 bsp=e000000002069260
 [<a000000100652e20>] ia64_do_page_fault+0x240/0xa40
                                sp=e00000000206f850 bsp=e000000002069210
 [<a00000010000c040>] __ia64_leave_kernel+0x0/0x280
                                sp=e00000000206f900 bsp=e000000002069210
 [<a0000001001a1e30>] do_sys_poll+0x590/0x740
                                sp=e00000000206fad0 bsp=e000000002069180
 [<a0000001001a2680>] sys_poll+0x80/0xc0
                                sp=e00000000206fe20 bsp=e000000002069128
 [<a00000010000bea0>] ia64_ret_from_syscall+0x0/0x40
                                sp=e00000000206fe30 bsp=e000000002069128
 [<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400
                                sp=e000000002070000 bsp=e000000002069128
...
Swap cache: add 45947, delete 45947, find 158/264, race 0+0
Free swap  = 0kB
Total swap = 720864kB
Out of memory: Killed process 2278 (stress).
stress: page allocation failure. order:0, mode:0x280d2
...


Output for fv_memory testing:
----------------------------------
...
Starting Threaded Memory Test
running for more than free memory at 195 MB for 60 sec.
automount invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0

Call Trace:
 [<a000000100013ba0>] show_stack+0x40/0xa0
                                sp=e00000000331fa70 bsp=e000000003319500
 [<a000000100013c30>] dump_stack+0x30/0x60
                                sp=e00000000331fc40 bsp=e0000000033194e8
 [<a000000100113a50>] out_of_memory+0xf0/0x780
                                sp=e00000000331fc40 bsp=e0000000033194b0
 [<a00000010011a2c0>] __alloc_pages+0x420/0x540
                                sp=e00000000331fcd0 bsp=e000000003319440
 [<a000000100152110>] alloc_pages_current+0x170/0x1a0
                                sp=e00000000331fce0 bsp=e000000003319410
 [<a00000010010b7c0>] page_cache_alloc_cold+0x1a0/0x1c0
                                sp=e00000000331fce0 bsp=e0000000033193e8
 [<a00000010011dec0>] __do_page_cache_readahead+0x120/0x460
                                sp=e00000000331fce0 bsp=e000000003319388
 [<a00000010011eb00>] do_page_cache_readahead+0xe0/0x120
                                sp=e00000000331fd70 bsp=e000000003319350
 [<a000000100111820>] filemap_nopage+0x280/0x7c0
                                sp=e00000000331fd70 bsp=e0000000033192e8
 [<a000000100132170>] __handle_mm_fault+0x3d0/0x1d00
                                sp=e00000000331fd70 bsp=e000000003319270
 [<a000000100652e20>] ia64_do_page_fault+0x240/0xa40
                                sp=e00000000331fd80 bsp=e000000003319220
 [<a00000010000c040>] __ia64_leave_kernel+0x0/0x280
                                sp=e00000000331fe30 bsp=e000000003319220
...
Swap cache: add 0, delete 0, find 0/0, race 0+0
Free swap  = 0kB
Total swap = 0kB
Out of memory: Killed process 2270 (threaded_memtes).
done.
...