Bug 235164 - [PATCH] System hang caused by endless loop in create_buffers()
[PATCH] System hang caused by endless loop in create_buffers()
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: mm (Show other bugs)
3.8
x86_64 Linux
medium Severity high
: ---
: ---
Assigned To: Nalin Dahyabhai
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-04-04 01:41 EDT by Ryutaro Hayashi
Modified: 2007-11-16 20:14 EST (History)
0 users

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-10-19 14:37:38 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
The patch file for slab_usable_pages() (557 bytes, patch)
2007-04-04 01:41 EDT, Ryutaro Hayashi
no flags Details | Diff

  None (edit)
Description Ryutaro Hayashi 2007-04-04 01:41:26 EDT
Description of problem:

Machine: SunFire X4200
CPU    : Dual-Core AMD Opteron(tm) Processor 2220 SE X 2
OS     : RHEL3U8
Memory : 32GB

Our customer run into system hang due to endless loop in create_buffers().
Fortunately, the system has NMI switch and we were able to get crash dump 
after the hang. And then we analyzed the dump.

The issue happens if system is in memory shortage. kswapd try to make 
free pages by do_try_to_free_pages_kswapd(), and call rebalance_dirty_zone().
But, if there is no free pages for buffer_head which is used to write 
back dirty pages, the system run into endless loop in create_buffers().

The flow is shown below:

------------------------------------------------------------

  kswapd()
  --> do_try_to_free_pages_kswapd()
      --> rebalance_dirty_zone()
          max_loop = zone->inactive_dirty_pages;
          if (max_loop < BATCH_WORK_AMOUNT)
          	max_loop = BATCH_WORK_AMOUNT;
          lru_lock(zone);
          while (max_loop-- && !list_empty(&zone->inactive_dirty_list))
          {
            --> launder_page()
                ...
                  --> create_buffers()
                      ...
                      try_again:                       <-----------+
                        bh = get_unused_buffer_head(async);        |
                        if (!bh)                                   |
                        	goto no_grow;          --------+   |
                        ...                                    |   |
                      no_grow:                         <-------+   |
                        free_more_memory()                         |
                        goto try_again;                ------------+
          }
          lru_unlock(zone);


------------------------------------------------------------

Basically, this issue was fixed in launder_page() of RHEL3U6.
I'm not sure the Bug ID. But, after the fix, launder_page() never 
call swap_writepage() if there is no unused buffer_head or free pages. 
It looks the issue has gone.

RHEL3U5:
          if ((gfp_mask & __GFP_FS) && writepage) {
                  ClearPageDirty(page);
                  SetPageLaunder(page);
                  lru_unlock(zone);

                  writepage(page);

RHEL3U6 or later:

           if ((gfp_mask & __GFP_FS) && writepage &&
                           (page->buffers || slab_usable_pages(zone))) {
                                             ^^^^^^^^^^^^^^^^^^^^^^^check free
pages for buffer_head
                   ClearPageDirty(page);
                   SetPageLaunder(page);
                   lru_unlock(zone);

                   writepage(page);


BUT, the issue still happens due to check miss of free page in 
slab_usable_pages().


ROOT CAUSE
==========

In the slab_usable_pages(), this routine check free_pages of DMA zone.
But, free of Normal Zone should be checked. Currently the DMA zone is 
pointed by macro ZONE_NORMAL(1) as index number of node_zonelists[].
node_zonelists[1] is "DMA" zone. So this must be zero in order to point 
Normal zone since node_zonelists[0] is Normal zone.

Due to above reason, launder_page() think there are free pages since 
DMA zone has free pages, which is not able to use for buffer_head, and then 
start to swap out even if there is no free page for buffer_head in Norma zone, 
and result in infinite loop of create_buffers() and system hang.  

If slab_usable_pages() check free of Normal zone correctly, swap_writepage()
is never called if there is no free page, and kswapd would continue to make
free pages. This means system hang does not happen in this condition.

Here is the code which check free_pages of DMA zone by mistake.

mm/vmscan.c(RHEL3U8):

292 static int slab_usable_pages(zone_t * inzone)
293 {
294         pg_data_t *pgdat;
295         zonelist_t *zonelist;
296         zone_t **zone;
297 
298         /* fast path to prevent looking at other zones */
299 #if defined(CONFIG_IA64) || !defined(CONFIG_HIGHMEM)
300         if (inzone->free_pages)
301                 return 1;
302 #else
303         if (inzone->zone_pgdat->node_zones[ZONE_NORMAL].free_pages)
304                 return 1;
305 #endif
306         if (inzone - inzone->zone_pgdat->node_zones <= ZONE_NORMAL &&
307             inzone->free_pages)
308                 return 1;
309 
310         /* slow path */
311         for_each_pgdat(pgdat) {
312                 zonelist = pgdat->node_zonelists +
313 #if defined(CONFIG_IA64)
314                         ZONE_HIGHMEM;
315 #else
316                         ZONE_NORMAL;  <--- This is wrong. Should be zero.
317 #endif
318                 zone = zonelist->zones;
319                 if (*zone) {
320                         for (;;) {
321                                 zone_t *z = *(zone++);
322                                 if (!z)
323                                         break;
324                                 if (z->free_pages)
325                                         return 1;
326                         }
327                 }
328         }
329         return 0;
330 }

include/linux/mmzone.h:

136 #define ZONE_DMA                0
137 #define ZONE_NORMAL             1
138 #define ZONE_HIGHMEM            2
139 #define MAX_NR_ZONES            3



At the last, here is the result of dump analysis.


crash> bt
PID: 11     TASK: 10037f34000       CPU: 1   COMMAND: "kswapd"
 #0 [1081fffed90] disk_dump at ffffffffa00a6ee4
 #1 [1081fffee28] .text.lock.sched at ffffffff80122283
 #2 [1081fffee80] try_crashdump at ffffffff80124c9f
 #3 [1081fffee90] die_nmi at ffffffff80112605
 #4 [1081fffeeb0] default_do_nmi at ffffffff80112705
 #5 [1081fffef50] nmi at ffffffff80110f44    <------------- NMI switch pressed.
    [exception RIP: .text.lock.sched+23]
    RIP: ffffffff80122283  RSP: 0000010037f35c98  RFLAGS: 00000086
    RAX: 0000000000000001  RBX: 0000010037f30000  RCX: 0000000000002580
    RDX: 0000000000000000  RSI: 0000000000000001  RDI: 0000010037f30000
    RBP: 0000010037f35cc8   R8: 000000000000000b   R9: 000001000002b8b0
    R10: 0000000000000000  R11: 000001000002ac80  R12: ffffffff805f1800
    R13: 0000000000000000  R14: 0000010037f35c98  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <exception stack> ---
 #6 [10037f35c98] .text.lock.sched at ffffffff80122283
 #7 [10037f35cd0] __wake_up at ffffffff801206ef
 #8 [10037f35d20] free_more_memory at ffffffff801627bd
 #9 [10037f35d30] create_buffers at ffffffff8016329b    <------- loop in
create_buffers
#10 [10037f35d60] create_empty_buffers at ffffffff80163478
#11 [10037f35d80] brw_page at ffffffff801650ad
#12 [10037f35dd0] rw_swap_page_base at ffffffff80155d67
#13 [10037f35e50] rw_swap_page at ffffffff80155dba
#14 [10037f35e60] swap_writepage at ffffffff801579a4
#15 [10037f35e70] launder_page at ffffffff80150a70
#16 [10037f35eb0] rebalance_dirty_zone at ffffffff801540c0
#17 [10037f35ef0] do_try_to_free_pages_kswapd at ffffffff80154834
#18 [10037f35f20] kswapd at ffffffff80154e52
#19 [10037f35f50] kernel_thread at ffffffff80110d11

### page->buffers is NULL(this page does not have buffer_head)
crash> struct page.buffers 10012efc1e8
  buffers = 0x0

### unused buffer is zero. 
crash> rd nr_unused_buffer_heads
ffffffff805210b8:  0000000000000000                    ........

### There is no free pages except for DMA zone.
crash> kmem -f
NODE
  0
ZONE  NAME        SIZE    FREE      MEM_MAP       START_PADDR  START_MAPNR
  0   DMA         4096    2459    10001000040          0          161320  
                          ^^^^
ZONE  NAME        SIZE    FREE      MEM_MAP       START_PADDR  START_MAPNR
  1   Normal    4321279       0    10001068040       1000000       165416  

ZONE  NAME        SIZE    FREE      MEM_MAP       START_PADDR  START_MAPNR
  2   HighMem        0       0         0               0            0     

--------------------------------------------------------------------------

NODE
  1
ZONE  NAME        SIZE    FREE      MEM_MAP       START_PADDR  START_MAPNR
  0   DMA            0       0         0               0            0     

ZONE  NAME        SIZE    FREE      MEM_MAP       START_PADDR  START_MAPNR
  1   Normal    4194303       0    10420083030      420000000    170358430 

ZONE  NAME        SIZE    FREE      MEM_MAP       START_PADDR  START_MAPNR
  2   HighMem        0       0         0               0            0     

nr_free_pages: 2459  (verified)

Since there are free pages for DMA zone as you can see, launder_page() call 
swap_writepage() and this cause system hang.


Version-Release number of selected component (if applicable):


How reproducible:

Customer has test program which can reproduce this issue, but we cannot get it
due to lisence reason.

Steps to Reproduce:
1. Just run memory test program and wait one or two days.
2.
3.
  
Actual results:


Expected results:


Additional info:

The patch for this issue is attached.
Comment 1 Ryutaro Hayashi 2007-04-04 01:41:26 EDT
Created attachment 151643 [details]
The patch file for  slab_usable_pages()
Comment 2 RHEL Product and Program Management 2007-10-19 14:37:38 EDT
This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
 
For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/
 
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.

Note You need to log in before you can comment on or make changes to this bug.