Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 544448

Summary:

Strange vm performance degradation moving 32 bit app from RHEL 4.6 32bit to 5.4 64bit

Product:

Red Hat Enterprise Linux 5

Reporter:

Casey Dahlin <cdahlin>

Component:

kernel

Assignee:

Larry Woodman <lwoodman>

Status:

CLOSED ERRATA

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

5.4

CC:

aarcange, anderson, cevich, cward, fhirtz, herrold, jwest, lwoodman, mbelangia, plyons, riel, tao, vanhoof

Target Milestone:

Keywords:

ZStream

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

790921 (view as bug list)

Environment:

Last Closed:

2010-03-30 06:55:38 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

562746, 790921, 811628

Attachments:

Description	Flags
Comparison of two environments Anon pages and mmap()'d files per second	none
Graph showing total # lines in /proc/pid/maps summed across 8 processes	none
Patch sent to rhkernel-list to address this issue via /proc/sys/vm tunable	none
modified reproducer for easier inclusion in automated testing	cevich: review+

Description Casey Dahlin 2009-12-04 22:11:15 UTC

Description of problem:
A customer runs an app that frequently maps and unmaps large numbers of files from an nfs share. The files all fit within one page. There are 98 million of them total (only a relatively small chunk of which are mapped at a given time).

They moved this app from a 32bit box running RHEL 4.6 to a 64bit box running RHEL 5 (the app remains compiled for 32bit) and began to see gradually degrading performance as the app runs, with large amounts of CPU time being spent in the kernel. Oprofile reveals the following:

samples  %        image name               app name                 symbol name
123841042 77.4341  vmlinux                  vmlinux                  find_vma
2600739   1.6262  vmlinux                  vmlinux                  arch_get_unmapped_area
1985322   1.2414  vmlinux                  vmlinux                  arch_get_unmapped_area_topdown
1562323   0.9769  sunrpc                   sunrpc                   /sunrpc
1454803   0.9096  nfs                      nfs                      /nfs

A systemtap module designed to print backtraces for random find_vma calls shows that most of the calls are made as follows:

0xffffffff8000bc6c : find_vma+0x1/0x57 [kernel]
0xffffffff80066aab : do_page_fault+0x451/0x830 [kernel]
0xffffffff8005dde9 : add_disk_randomness+0x1052/0x1115 [kernel]

(add_disk_randomness is assumed to be a bogus symbol).

Version-Release number of selected component (if applicable):
Original RHEL 4 kernel: 2.6.9-67.0.22
RHEL 5 kernel: 2.6.18-164

How reproducible:


Steps to Reproduce:
Customer runs data through their app for several hours.
  
Actual results:
Performance gradually degrades

Expected results:
Performance remains consistent with RHEL 4

Comment 1 Larry Woodman 2009-12-07 16:55:02 UTC

Can someone include a "cat /proc/<pid>/maps" on both systems?  Perhaps there are a LOT more mapped files on the 64-bit RHEL5 system than the 32-bit RHEL4 system???

Also, can you include an AltSysrq-M output for both machines as well so we can see if this is some NUMA issue?

Thanks, Larry

Comment 3 Larry Woodman 2009-12-07 19:41:24 UTC

There is no difference in the find_vma() routines between RHEL4 & RHEL5 or 32-bit & 64-bit.  The only potential difference is a 64-bit system can map MANY more one-page files than a 32-bit system and the amount of time find_vma() takes corresponds directly with number of mappings or vmas.  The most likey cause of this *problem* is they have many more mapped files on the 64-bit system than they ever did the 32-bit system.

A simple "cat /proc/<pid>/maps | wc" will give us the answer.

Larry

Comment 5 Issue Tracker 2009-12-10 15:31:23 UTC

Event posted on 12-10-2009 10:31am EST by cevich

data from pmap_sum (server01 only).

Two setups are basically on-par, though 5.4 seemed to take longer
(haven't verified by timestamps yet).  I'm sure engineering will find
this helpful/interesting.


This event sent from IssueTracker by cevich 
 issue 361893
it_file 280797

Comment 6 Chris Evich 2009-12-10 15:33:11 UTC

Created attachment 377481 [details]
Comparison of two environments Anon pages and mmap()'d files per second

Comment 7 Chris Evich 2009-12-10 15:37:27 UTC

Note: It "looks" like it, but I'm double-checking that these are average number of ANON & MMAP / 15-second intervals.  The tests in each environment are identical -- basically they're "playing back" the same, real, transactional data recorded earlier.

Comment 8 Larry Woodman 2009-12-10 16:06:34 UTC

What am I missing here?  This graph shows very similar times for anon and mapped files for both 4.6 & 5.3.  

Larry

Comment 9 Rik van Riel 2009-12-10 16:09:02 UTC

For one, we don't know the number of unmaps.  If the 64 bit system keeps more files mapped than the 32 bit system, that would explain the slowdown.

Of course, the data provided does not actually contain the info we need :)

Comment 10 Rik van Riel 2009-12-10 16:35:17 UTC

Never mind me.  The graph contains the number of VMAs (file and anon), sampled every second, which is the data we wanted.

It also shows that the slowdown in 5.4 is pretty small compared to 4.6, so chances are 4.6 has similar CPU use in find_vma.

Comment 11 Larry Woodman 2009-12-10 16:37:46 UTC

If there were lots more vmas on the 5.4/64-bit system u'd thing the slope
would be flatter than on the 4.6/32-bit system.

Larry

Comment 12 Chris Evich 2009-12-10 17:23:47 UTC

Sorry for confusion.  I'm 80% sure this data comes from sampling /proc/pid/maps several times, sorting based on mmap's and anon maps, totaling them and averaging # over 15s interval.  I'm double-checking this w/ customer.   Assuming that's what it is, it appears as if there's virtually no difference over time between 4u6 and 5.4 test runs for one of 8 processes on each.  Eye-balling data, other 7 processes data looked similar, so I didn't bother graphing.  I can upload raw data though if it will help...

Comment 13 Chris Evich 2009-12-10 17:25:02 UTC

We have oprofile data from both 4u6 and 5.4, I'll double-check but I believe they do show find_vma as the most frequent

Comment 14 Chris Evich 2009-12-10 17:34:44 UTC

Re: slowdown 4u6 (32bit) vs 5.4 (32bit code 64bit kernel)

Yes, after switching on legacy_va_layout the overall performance of test/app. is the same.  Customer's main concern is that performance in certain areas has gotten worse while others (logically) must have gotten better.  One of the things they've measured as being worse is the performance of mmap() -- By a factor of 10x slower in 5.4

Comment 15 Chris Evich 2009-12-15 21:17:20 UTC

Confirmed w/ customer, I was mislead initially, this is just the size NOT the # of VMAs.  I'll get back w/ customer to provide # VMAs after the holidays.

(In reply to comment #12)
> Sorry for confusion.  I'm 80% sure this data comes from sampling /proc/pid/maps
> several times, sorting based on mmap's and anon maps, totaling them and
> averaging # over 15s interval.

Comment 16 Issue Tracker 2009-12-22 18:55:20 UTC

Event posted on 12-22-2009 01:55pm EST by mbelangia




This event sent from IssueTracker by mbelangia 
 issue 361893

Comment 22 Chris Evich 2010-01-11 19:54:19 UTC

(In reply to comment #3)
> takes corresponds directly with number of mappings or vmas.  The most likey
> cause of this *problem* is they have many more mapped files on the 64-bit
> system than they ever did the 32-bit system.
> 
> A simple "cat /proc/<pid>/maps | wc" will give us the answer.
> 
> Larry    

Note (b/c I haven't seen it mentioned anywhere):

The answer to this question would have come from data provided in (private) comment https://bugzilla.redhat.com/show_bug.cgi?id=544448#c17

That data _should_ be, the contents of /proc/pid/maps captured 1-hour into the test, for 1-hour, at 1-minute intervals.

Comment 23 Issue Tracker 2010-01-11 23:11:05 UTC

Event posted on 01-11-2010 06:11pm EST by fhirtz

<email>
For the next test iteration, we have a couple of things to check since
they appear broken with regard to the data gathering, and a change:

1) A few of the items that were to be gathered came out as empty:
   a) opreport
   b) systemtap output
2) We don't need the maps gathering to be done at 1min intervals. 1 hour
intervals would be fine for them.

We're still ingesting the information that you sent over, and though the
limited systemtap data was a bit different than we'd seen before in what
it was showing, we still believe that we're in the right area in terms of
where we're looking with regard to mapped memory. The reproducer should
give us the ability to test this in a different manner, though we do need
to confirm that it's showing the same thing as the actual test data.

Please let me know if you have questions, observatations, or concerns, and
thank you for all of the work here.

Frank.
</email>


This event sent from IssueTracker by fhirtz 
 issue 361893

Comment 24 Chris Evich 2010-01-11 23:30:04 UTC

Created attachment 383120 [details]
Graph showing total # lines in /proc/pid/maps summed across 8 processes

As expected...not much different from RHEL 4u6 to 5.4 in terms of number of VMAs over time.  

The same data captured every hour instead of per-minute will be interesting when test run is done.

Comment 25 Issue Tracker 2010-01-12 03:57:20 UTC

Event posted on 01-11-2010 10:57pm EST by fhirtz

A quick graph of mpstat "all" CPU consumption. %user increases marginally
over the course of the ~25hr run, and %sys increases from ~7% to %17% over
the course of the run. It doesn't appear that system time is starving out
user time exactly since user % isn't suffering, but it could result in
higher latencies one would presume.


This event sent from IssueTracker by fhirtz 
 issue 361893
it_file 291213

Comment 26 Issue Tracker 2010-01-12 20:26:34 UTC

Event posted on 01-12-2010 03:25pm EST by fhirtz

File uploaded: mmap.rep.anon.c

This event sent from IssueTracker by fhirtz 
 issue 361893
it_file 294033

Comment 27 Issue Tracker 2010-01-12 20:26:37 UTC

Event posted on 01-12-2010 03:25pm EST by fhirtz

A test reproducer from Alan. I've requested a bit of information from him
to get an idea how long that they're having to run this and what sort of
results that they've been getting with it so we can make sure that we're
seeing analogous results here.


This event sent from IssueTracker by fhirtz 
 issue 361893

Comment 28 Chris Evich 2010-01-12 21:35:00 UTC

Setting 'CLR_MMAP_CNT     75000' and using line 149 instead of 150 makes it degrade very very rapidly (i.e. lots more holes w/ more widely varying sizes).  Time goes up over 100+ after 50 iterations for me instead of 250 w/ vanilla parameters.

Comment 29 Issue Tracker 2010-01-12 21:53:09 UTC

Event posted on 01-12-2010 04:53pm EST by fhirtz

Test program compiled as 32bit running untuned and with legacy_vm_layout
respectively.


This event sent from IssueTracker by fhirtz 
 issue 361893
it_file 295093

Comment 30 Issue Tracker 2010-01-12 21:58:51 UTC

Event posted on 01-12-2010 04:58pm EST by fhirtz

As an interesting test quirk; RHEL4 doesn't like this test at all:

<32bit>
[root@v40z]# ./mmap.rep.anon.32
           1:   8334: TIME ml=                   17 
           2:  16297: TIME ml=                   27 
           3:  24106: TIME ml=                   32 
....
          94: 182778: TIME ml=                   41 
          95: 190278: TIME ml=                   46 
          96: 197778: TIME ml=                   44 
          97: 205278: TIME ml=                 4239 
          98: 194125: TIME ml=                   45 
          99: 201625: TIME ml=                   44 
         100: 174125: TIME ml=                   36 
         101: 181625: TIME ml=                   42 
         102: 189125: TIME ml=                 1231 
         103: 196625: TIME ml=                12163 
         104: 204125: TIME ml=                11591 
         105: 193077: TIME ml=                   47 
         106: 200577: TIME ml=                 2269 
         107: 173077: TIME ml=                   38 
         108: 180577: TIME ml=                   46 
         109: 188077: TIME ml=                 8337 
MMAP ERROR - 12:Cannot allocate memory
</32bit>

<64bit>
[root@v40z]# ./mmap.rep.anon
           1:   8334: TIME ml=                   25 
           2:  16297: TIME ml=                   29 
           3:  24106: TIME ml=                   35 
           4:  31833: TIME ml=                   38 
           5:  39509: TIME ml=                   40 
...
         127: 197975: TIME ml=                   33 
         128: 205475: TIME ml=                   34 
         129: 177975: TIME ml=                   37 
         130: 185475: TIME ml=                   37 
         131: 192975: TIME ml=                   38 
         132: 200475: TIME ml=                   39 
         133: 189762: TIME ml=                   40 
Segmentation fault
</64bit>



This event sent from IssueTracker by fhirtz 
 issue 361893

Comment 31 Issue Tracker 2010-01-12 22:22:34 UTC

Event posted on 01-12-2010 05:22pm EST by fhirtz

Further note that RHEL4 32bit OS fails in the same manner as the 32bit on
64bit RHEL4 example earlier.


This event sent from IssueTracker by fhirtz 
 issue 361893

Comment 32 Issue Tracker 2010-01-12 22:58:58 UTC

Event posted on 01-12-2010 05:58pm EST by fhirtz

RHEL5 32bit, 32bit application run quick test graph.


This event sent from IssueTracker by fhirtz 
 issue 361893
it_file 295483

Comment 33 Issue Tracker 2010-01-14 18:16:37 UTC

Event posted on 01-14-2010 01:16pm EST by cevich

Attaching latest 'standardized' reproducer we're using on all
platforms/architectures.  It was modified from original to:

+ Run for 1000 iterations then stop
+ Output in CSV format
+ Summarize min/avg/max/stdev at end


This event sent from IssueTracker by cevich 
 issue 361893
it_file 302843

Comment 39 Larry Woodman 2010-01-29 20:46:35 UTC

Created attachment 387641 [details]
Patch sent to rhkernel-list to address this issue via /proc/sys/vm tunable


This patch was sent to rhkernel-list to address this issue.  If you set /proc/sys/vm/topdown_allocate_fast to 1(non-zero) we will use a 32-bit topdown allocation algorithm that is optimized for CPU speed rather than virtual address space fragmentation.

Larry Woodman

Comment 40 Issue Tracker 2010-01-29 21:28:39 UTC

Event posted on 01-29-2010 04:28pm EST by cevich

Attaching script you can use to "see" VMA density based on the contents
of /proc/pid/maps.  Given the number of VMA involved here, it's not fast
enough to run real-time.  Recommend using it to parse samples of
/proc/pid/maps captured over time.

As from the graphs I posted earlier, what you should see is more or less
uniform density across address space.  It shouldn't start small and grow
over time as in the original RHEL 5.4 32/64 measurements.  

Since the fix we're doing is x86_64 specific, this tool may come in handy
for examining behavior on other architectures (which can "do" memory
entirely differently).


This event sent from IssueTracker by cevich 
 issue 361893
it_file 348543

Comment 41 Chris Evich 2010-02-02 19:00:53 UTC

Created attachment 388364 [details]
modified reproducer for easier inclusion in automated testing

Reproducer based on original reproducer supplied by customer.  Modified to suppress output and stop testing after 10k iterations.  It exits with non-zero or zero if the test fails or passes.  Failure conditions are:

minimum mmap() call wall-time exceeds 100ms
or
maximum mmap() call wall-time exceeds 1000ms
or
Average mmap() call wall-time exceeds 500ms
or
Standard deviation in wall-time for mmap() call exceeds 10ms

Note: The timings will be tied to CPU/Memory speed of the system. They key is the standard deviation check.  The other tests are mostly just for sanity.  When mmap() call wall-time is low and consistent despite VM fragmentation, we see values around 1-5ms, even on low-end hardware.  Worst case when mmap() call wall-time is inconsistent, we see values over 20ms and often much higher (into the 100's of ms).

Comment 42 RHEL Program Management 2010-02-04 23:42:28 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 47 Chris Ward 2010-02-11 10:10:13 UTC

~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.

Comment 48 Jan Tluka 2010-02-24 14:00:57 UTC

I've tried to reproduce with 189.el5 kernel:

# cat /proc/sys/vm/topdown_allocate_fast 
0
# echo 1 >  /proc/sys/vm/topdown_allocate_fast 
# cat /proc/sys/vm/topdown_allocate_fast 
1
# time ./agua_frag_64 > agua_frag_64.log

real    0m16.691s
user    0m6.569s
sys     0m10.119s
[root@hp-dl360g6-02 ~]# time ./agua_frag_32 > agua_frag_32.log

This won't finish and following message floods console:

BUG: soft lockup - CPU#3 stuck for 10s! [agua_frag_32:5198]
CPU 3:
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport hpilo shpchp bnx2 serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 5198, comm: agua_frag_32 Not tainted 2.6.18-189.el5 #1
RIP: 0010:[<ffffffff80010a73>]  [<ffffffff80010a73>] find_vma_prev+0x14/0x57
RSP: 0018:ffff810189863e20  EFLAGS: 00000287
RAX: ffff81031524ee48 RBX: ffff81031ef8a500 RCX: ffff810313683dc8
RDX: ffff810189863e58 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000000 R08: ffff810314091c38 R09: ffffffff80071894
R10: ffff810189862000 R11: 0000000000000000 R12: 0000000000000000
R13: ffff81031524ee48 R14: ffff81031524ee48 R15: ffff8103170b7ef8
FS:  0000000000000000(0000) GS:ffff81031ff676c0(0063) knlGS:00000000f7d896c0
CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 000000001bb3a128 CR3: 0000000314042000 CR4: 00000000000006e0

Call Trace:
 [<ffffffff80071a0d>] arch_get_unmapped_area_topdown_fast+0x179/0x21e
 [<ffffffff80016fb1>] vma_merge+0x1d8/0x1ee
 [<ffffffff800cf961>] get_unmapped_area_prot+0x8b/0x134
 [<ffffffff8000dcb2>] do_mmap_pgoff+0x1a0/0x7d7
 [<ffffffff800b85ce>] audit_syscall_entry+0x180/0x1b3
 [<ffffffff800655ab>] __down_write_nested+0x12/0x92
 [<ffffffff80085204>] sys32_mmap2+0x76/0xcb
 [<ffffffff8006249d>] sysenter_do_call+0x1e/0x76

I'll try with the latest RHEL5.5 kernel -190.el5, but I don't think it helps.

Comment 49 Jan Tluka 2010-02-24 14:17:43 UTC

Additional info to my previous comment:

Running 32-bit application with topdown allocation optimized for
address space fragmentation, I get following result:

# cat /proc/sys/vm/topdown_allocate_fast 
0
# time ./agua_frag_32

-1,Min Time (ms),5
-2,Avg. Time (ms),116.0000
-3,Max Time (ms),233
-4,Std Dev (ms),48.0076

real    2m5.270s
user    0m0.913s
sys     2m4.337s

I don't see those lockup messages on console.

Comment 50 Chris Evich 2010-02-24 15:07:46 UTC

I've seen this as well but when I tried it on a test box with more memory available, it ran fine.  This may be an actual bug, I'm not one to judge that.  

IMO (need to double-check), the test requires 2-3gig available memory.  Also, I've been setting

sysctl -w vm.max_map_count=1073741824

The default is 65536 and you can see it's allocating over 200k mmaps.  But the results from your run are way out of range as well:

-4,Std Dev (ms),48.0076

My 15gig box reports:

-4,Std Dev (ms),4.3147

(In reply to comment #48)
> [root@hp-dl360g6-02 ~]# time ./agua_frag_32 > agua_frag_32.log
> 
> This won't finish and following message floods console:
> 
> BUG: soft lockup - CPU#3 stuck for 10s! [agua_frag_32:5198]
...
> Call Trace:
>  [<ffffffff80071a0d>] arch_get_unmapped_area_topdown_fast+0x179/0x21e
>  [<ffffffff80016fb1>] vma_merge+0x1d8/0x1ee
>  [<ffffffff800cf961>] get_unmapped_area_prot+0x8b/0x134
>  [<ffffffff8000dcb2>] do_mmap_pgoff+0x1a0/0x7d7
>  [<ffffffff800b85ce>] audit_syscall_entry+0x180/0x1b3
>  [<ffffffff800655ab>] __down_write_nested+0x12/0x92
>  [<ffffffff80085204>] sys32_mmap2+0x76/0xcb
>  [<ffffffff8006249d>] sysenter_do_call+0x1e/0x76

Comment 51 Chris Evich 2010-02-24 15:16:12 UTC

followup:

# ./agua_frag_32
...
319,196657,39 
320,204157,40 
(ctrl-z)
[1]+  Stopped                 ./agua_frag_32
# grep Vm /proc/`pidof agua_frag_32`/status
VmPeak:  2768800 kB
VmSize:  2487400 kB
VmLck:         0 kB
VmHWM:      5340 kB
VmRSS:      5340 kB
VmData:  2485744 kB
VmStk:        84 kB
VmExe:         8 kB
VmLib:      1528 kB
VmPTE:        40 kB
# wc -l /proc/`pidof agua_frag_32`/maps
58668 maps

So, question of a bug is: if you run out of memory, should the system stall like that inside arch_get_unmapped_area_topdown_fast()?

Comment 52 Jan Tluka 2010-02-24 15:30:39 UTC

I'm running on following configuration:

# free -m
             total       used       free     shared    buffers     cached
Mem:         12008        492      11516          0         48        271
-/+ buffers/cache:        171      11837
Swap:        14015          0      14015

Comment 53 Jan Tluka 2010-02-24 15:37:25 UTC

Setting vm.max_map_count to 1073741824 solves this problem. Tried several times.

# sysctl -w vm.max_map_count=1073741824
# ./agua_frag_32
-1,Min Time (ms),5
-2,Avg. Time (ms),12.0000
-3,Max Time (ms),15
-4,Std Dev (ms),1.6643

Comment 54 Jan Tluka 2010-02-24 15:43:18 UTC

(In reply to comment #51)

> So, question of a bug is: if you run out of memory, should the system stall
> like that inside arch_get_unmapped_area_topdown_fast()?    

From the user point of view, system becomes unresponsive (with the default config). So I'd suggest to file a new bug targeting this problem.

Comment 55 Jan Tluka 2010-02-24 16:10:39 UTC

Flipping back to ON_QA. Setting needinfo on Larry, to make a decision about opening new bug.

Comment 56 Chris Evich 2010-02-24 16:21:03 UTC

(In reply to comment #54)
> From the user point of view, system becomes unresponsive (with the default
> config). So I'd suggest to file a new bug targeting this problem.    

Not true for me, I do _not_ get the softlock when running w/o tunings (i.e. w/o the fix for this bug):

# echo 0 > /proc/sys/vm/topdown_allocate_fast 
# echo 65536 > /proc/sys/vm/max_map_count 
# ./agua_frag_32 > agua_frag_32.log
# tail -4 agua_frag_32.log 
-1,Min Time (ms),10
-2,Avg. Time (ms),388.0000
-3,Max Time (ms),806
-4,Std Dev (ms),165.2773

I only see the softlock (almost immediately) with vm.max_map_count=65536 and vm.topdown_allocate_fast=1:

Feb 24 11:16:43 sun-x4200-1 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [agua_frag_32:23102]
...
Feb 24 11:16:43 sun-x4200-1 kernel:  [<ffffffff80070988>] arch_get_unmapped_area_topdown_fast+0x179/0x21e

The softlock is happening w/in the function we added to fix this bug and which is activated by vm.topdown_allocate_fast=1

Same bug...new bug doesn't matter to me, but clearly there's a behavior change.

Comment 57 Chris Evich 2010-02-24 16:21:16 UTC

(In reply to comment #54)
> From the user point of view, system becomes unresponsive (with the default
> config). So I'd suggest to file a new bug targeting this problem.    

Not true for me, I do _not_ get the softlock when running w/o tunings (i.e. w/o the fix for this bug):

# echo 0 > /proc/sys/vm/topdown_allocate_fast 
# echo 65536 > /proc/sys/vm/max_map_count 
# ./agua_frag_32 > agua_frag_32.log
# tail -4 agua_frag_32.log 
-1,Min Time (ms),10
-2,Avg. Time (ms),388.0000
-3,Max Time (ms),806
-4,Std Dev (ms),165.2773

I only see the softlock (almost immediately) with vm.max_map_count=65536 and vm.topdown_allocate_fast=1:

Feb 24 11:16:43 sun-x4200-1 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [agua_frag_32:23102]
...
Feb 24 11:16:43 sun-x4200-1 kernel:  [<ffffffff80070988>] arch_get_unmapped_area_topdown_fast+0x179/0x21e

The softlock is happening w/in the function we added to fix this bug and which is activated by vm.topdown_allocate_fast=1

Same bug...new bug doesn't matter to me, but clearly there's a behavior change.

Comment 58 Jan Tluka 2010-02-24 16:30:01 UTC

Yes, you're right. The default is 0 (optimized for fragmentation) and with this setting I do not see the lockups. Sorry for confusion. So the user that does not need/use this tunings is not affected.

Comment 59 Dave Anderson 2010-02-25 14:01:40 UTC

I have an RHTS machine with the latest kernel installed (190.el5),

  # cat /proc/sys/vm/topdown_allocate_fast
  1
  # cat /proc/sys/vm/max_map_count
  65536
  #

and it either fails relatively quickly like this:

  # ./agua_frag_test_32 > agua_frag_32.log
  .MMAP ERROR - 12:Cannot allocate memory
  # ./agua_frag_test_32 > agua_frag_32.log
  .MMAP ERROR - 12:Cannot allocate memory
  # 

but now on the 3rd attempt (preceded by time) it has hung:

  # time ./agua_frag_test_32 > agua_frag_32.log
  .
  
I instrumented find_vma_prev() to determine whether it's
stuck in the "while (rb_node)" loop -- but it's not.  So
it must be coming back into find_vma_prev() some other way
without leaving the kernel.

Dave

Comment 60 Dave Anderson 2010-02-25 17:02:09 UTC

OK, here's what's happening:

Upon the 1204975'th call to arch_get_unmapped_area_topdown_fast(),
the task gets stuck in the function's do-while loop:

arch_get_unmapped_area_topdown_fast(struct file *filp, const unsigned long addr0,
		const unsigned long len, const unsigned long pgoff,
		const unsigned long flags)
{
	struct vm_area_struct *vma, *prev_vma;
	struct mm_struct *mm = current->mm;
	unsigned long base = mm->mmap_base, addr = addr0;
	int first_time = 1;
	unsigned long begin, end;
	
	find_start_end(flags, &begin, &end); 

	/* requested length too big for entire address space */
	if (len > TASK_SIZE)
		return -ENOMEM;

	/* dont allow allocations above current base */
	if (mm->free_area_cache > base)
		mm->free_area_cache = base;

	if (flags & MAP_FIXED)
		return addr;

	/* requesting a specific address */
	if (addr) {
		addr = PAGE_ALIGN(addr);
		vma = find_vma(mm, addr);
		if (TASK_SIZE - len >= addr &&
				(!vma || addr + len <= vma->vm_start))
			return addr;
	}

try_again:
	/* make sure it can fit in the remaining address space */
	if (mm->free_area_cache < len)
		goto bottomup;

	/* either no address requested or cant fit in requested address hole */
	addr = (mm->free_area_cache - len) & PAGE_MASK;
	do {
		/*
		 * Lookup failure means no vma is above this address,
		 * i.e return with success:
		 */
 	 	if (!(vma = find_vma_prev(mm, addr, &prev_vma)))
			return addr;

		/*
		 * new region fits between prev_vma->vm_end and
		 * vma->vm_start, use it:
		 */
		if (addr && addr+len <= vma->vm_start &&
				(!prev_vma || (addr >= prev_vma->vm_end)))
			/* remember the address as a hint for next time */
			return (mm->free_area_cache = addr);
		else
			/* pull free_area_cache down to the first hole */
			if (mm->free_area_cache == vma->vm_end)
				mm->free_area_cache = vma->vm_start;

		/* try just below the current vma->vm_start */
		addr = vma->vm_start-len;
	} while (len <= vma->vm_start);
 
...

(1) The function's entry address parameter "addr0" was 0 and "len" is 0x1000.
(2) Entering the do-while loop, the local "addr" value was still 0.
(3) At the bottom of the function, each time around the loop: 
    -  "addr" remains 0, because...
    -  "vma" is 0xffff810219400608, which is the first VMA in the task's
       address space, which has a vm->vm_start of 0x1000. 
(4) So "len" is always <= vma->vm_start -- hence the hang...

The task has 59208 VMA's, and looks like this:

  crash> vm
  PID: 3484   TASK: ffff81042feaa840  CPU: 2   COMMAND: "agua_frag_test_"
         MM               PGD          RSS    TOTAL_VM
  ffff81042defa840  ffff81042adcb000  5284k   3243892k
        VMA           START       END     FLAGS FILE
  ffff810219400608       1000     6e3000     71 
  ffff81022d4c0978     6e3000     6fd000    875 /lib/ld-2.5.so
  ffff81022d4c0ce8     6fd000     6fe000 100871 /lib/ld-2.5.so
  ffff81022d4c0d98     6fe000     6ff000 100873 /lib/ld-2.5.so
  ffff81022d4c0608     701000     840000     75 /lib/libc-2.5.so
  ffff81022d4c0558     840000     841000     70 /lib/libc-2.5.so
  ffff81022d4c06b8     841000     843000 100071 /lib/libc-2.5.so
  ffff81022d4c08c8     843000     844000 100073 /lib/libc-2.5.so
  ffff81022d4c0ad8     844000     847000 100073 
  ffff81022d4c0ef8     84f000     874000     75 /lib/libm-2.5.so
  ffff81022d4c03f8     874000     875000 100071 /lib/libm-2.5.so
  ffff81022d4c0768     875000     876000 100073 /lib/libm-2.5.so
  ffff810219a404a8     882000    565e000     71 
  ffff8102196e0298    565f000    5681000     71 
  ffff8102196e0b88    5688000    56a4000     71 
  ffff810219c98c38    56a5000    56cd000     71 
  ffff810219c98d98    56d4000    56ea000     71
  ... [ cut ] ...
  ffff81021a2ed3f8   f7f72000   f7f73000     71 
  ffff810219474978   f7f74000   f7f75000     71 
  ffff81021954b3f8   f7f76000   f7f88000     71 
  ffff8102196dcd98   f7f89000   f7f91000     71 
  ffff810219cfaef8   f7f92000   f7f95000     71 
  ffff810219b43ad8   f7f97000   f7fab000     71 
  ffff81022d4c01e8   f7fab000   f7fad000 100073 
  ffff81021a304ef8   f7fad000   f7fbd000     71 
  ffff81042e9fc088   ffd50000   ffd65000 100173 
  ffff81022d4c0818   ffffe000   fffff000 8040075
  crash>

Comment 65 Dave Anderson 2010-02-26 16:47:09 UTC

Add-on patch posted:

[RHEL5.5 PATCH] BZ #544448: update to "prevent severe performance degradation" patch
http://post-office.corp.redhat.com/archives/rhkernel-list/2010-February/msg01374.html

Comment 68 Jarod Wilson 2010-03-03 15:43:58 UTC

in kernel-2.6.18-191.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Please update the appropriate value in the Verified field
(cf_verified) to indicate this fix has been successfully
verified. Include a comment with verification details.

Comment 70 errata-xmlrpc 2010-03-30 06:55:38 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html