Bug 145818 - Memory deadlock when creating large file on iSCSI based ext3 filesystem and machine with 1GB ram
Summary: Memory deadlock when creating large file on iSCSI based ext3 filesystem and m...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: i386
OS: Linux
medium
high
Target Milestone: ---
Assignee: Mike Christie
QA Contact:
URL:
Whiteboard: RHEL3U7NAK
: 150480 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-01-21 21:36 UTC by Dave Wysochanski
Modified: 2007-11-30 22:07 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-09-14 14:00:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Output from sysrq-m and sysrq-t for the hang (34.80 KB, text/plain)
2005-01-21 21:41 UTC, Dave Wysochanski
no flags Details
Messages file from machine that failed with deadlock (232.29 KB, text/plain)
2005-01-21 22:00 UTC, Dave Wysochanski
no flags Details
sysrq-t and sysrq-m for most common machine state (32.87 KB, text/plain)
2005-01-25 23:31 UTC, Dave Wysochanski
no flags Details
system info describing the hangs on i386 and ia-64 with 20+ luns dt test (7.97 MB, application/x-gtar)
2005-07-06 19:53 UTC, Dave Wysochanski
no flags Details
i386 dt version (395.85 KB, application/octet-stream)
2005-07-06 20:29 UTC, Dave Wysochanski
no flags Details
ia-64 dt binary (395.85 KB, application/octet-stream)
2005-07-06 20:29 UTC, Dave Wysochanski
no flags Details

Description Dave Wysochanski 2005-01-21 21:36:19 UTC
Description of problem:
Memory deadlock when creating large file on system with 1GB ram.  The
1GB of ram does seem to be significant.  If I boot the same machine
with "mem=512M", this problem does not seem to occur.  Also, the
problem does not seem to occur on 2GB or 4GB systems, at least from
what we can tell.  This may just mean it's less reproducible on these
systems, but we've only ever seen it on 1GB i386 systems.

Version-Release number of selected component (if applicable):
linux-iscsi-3.6.2

How reproducible:
Happens every time, within a minute or so.

Steps to Reproduce:
1. Create a single 10GB LUN
2. Map to linux box, start iscsi
3. Create 1 partition that covers the whole disk using fdisk
4. Run "mkfs -t ext3" on the partition created in step 4, and mount
the lun at /iscsi_mnt/large_luns/10-1/
5. Run the following 'dd' command, which should create a single 5GB
file on the filesystem:
dd if=/dev/zero of=/iscsi_mnt/large_luns/10-1/file bs=1024 count=50000000
  
Actual results:
Machine lockup as a result of an OOM condition, with messages similar
to the following on the console, indicating ENOBUFs error from a
socket creation call:
iSCSI: session f79e0000 failed to create socket, rc -105
iSCSI: session f79e0000 connect failed at 259549
iSCSI: session f79e0000 to iqn.1992-08.com.netapp:sn.50393227.yanni
waiting 1 seconds before next login attempt
iSCSI: bus 0 target 1 trying to establish session f79e0000 to portal
1, address 10.60.155.93 port 3260 group 1
iSCSI: session f79e0000 failed to create socket, rc -105
iSCSI: session f79e0000 connect failed at 259655

Expected results:
Test to run to completion, without any lockups.

Additional info:
I'm not sure if this is specific to this RedHat release, or is a more
broad problem in the linux-iscsi-3.6.2 code.  I have not filed a bug
with sourceforge, but can provided we can repro this problem on
another system with a different kernel (maybe RHAS 2.1).  I'm also not
sure if this is a i386 specific problem, but so far this is the only
arch we've run on.

Comment 1 Dave Wysochanski 2005-01-21 21:41:27 UTC
Created attachment 110079 [details]
Output from sysrq-m and sysrq-t for the hang

I can probably get crashdump info if you need it.  I don't have this setup
currently.  Let me know if you want more info.

This is really easy to reproduce though, so you can probably do it on your end
easily.

Comment 2 Dave Wysochanski 2005-01-21 22:00:58 UTC
Created attachment 110080 [details]
Messages file from machine that failed with deadlock

This file contains all the messages from today, which includes a few of these
failures.  For the latest failure I reported this bug against, start from the
latest log messages and work backwards.

Comment 3 Dave Wysochanski 2005-01-22 00:17:27 UTC
I just tried this on RHAS 2.1 with 2.4.9-e.34smp and linux-iscsi-3.6.2
stock from sourceforge and it ran fine.

There was one typeo in the original dd command (one too many '0's). 
The correct command should be:
dd if=/dev/zero of=/iscsi_mnt/large_luns/10-1/file bs=1024 count=5000000

Comment 4 Dave Wysochanski 2005-01-23 16:23:40 UTC
Guys,

Is it possible that the 'dd' command above dirties so many pages that
the networking stack can't make forward progress?  Is there any way to
throttle the buffer cache's use of pages, e.g. limit the # of pages
that can be dirtied at one time?

Maybe I'm way off here.  I'm just not sure how to proceed, and this
looks like a pretty serious problem that's easily reproducible on
RHEL3, U4's linux-iscsi.

Comment 5 Dave Wysochanski 2005-01-23 17:17:48 UTC
It's really odd to me that this problem doesn't occur when I boot with
"mem=512M".  When I look at the free pages via sysrq-m, they seem to
be about the same in 512M case as in the 1G case (around 4K pages
free), but in the 1G case, the machine is frozen, unable to make
progress, and in the 512M case, the machine is fine.

Note that the iscsi driver does not set "highmem_io", so it does not
call kmap to use highmem pages.



Comment 6 Dave Wysochanski 2005-01-24 20:56:37 UTC
I just ran the Update2 kernel, and I don't see the deadlock.  I ran
the test multiple times.  At one point, I see a few messages on the
console indicating iscsi is seeing memory pressure, and the host is a
bit unresponsive, but it's nothing like the Update4 kernel behavior.

The Update4 kernel deadlocks everytime I run the test, within a minute
or so.

I compared the settings in /proc/sys/vm, and see the the pagecache
entry was changed from update2 to update4 as follows:
- update2:1       15      100
- update4:1       15      30

I tried changing the update4 values back to the update2 values and
re-running the test, but the change didn't seem to make much of a
difference.

Is there a more fundamental vm difference between update2 and update4?

Comment 7 Tom Coughlan 2005-01-25 02:57:01 UTC
I have been attempting to reproduce this on U4. I believe that I did
see exactly the same problem once.  On my other attempts I saw system
pauses and messages about connections lost/regained, but no lockup. I
will vary a few thigs and see if I can get it to reproduce more readily. 

The sysreq-m data shows than there is some memory available that
probably should have been allocated for the iSCSI request in this
situation. I am working with the VM experts to figure out what is
going on.   

There is not a way to throttle the buffer cache's use of pages. As I
unserstand it, such schemes cause more harm than good, as long as the
 VM is working correctly.

Thanks for checking U2.  We will look at the U2/U4 diffs for hints.

Tom

Comment 8 Dave Wysochanski 2005-01-25 14:08:23 UTC
It seems like there would be a fundamental problem with iSCSI if
there's no way to throttle the page cache from using memory?  What
guarantees the networking stack can get the memory it needs to make
forward progress?

What prevents the following schenario?  Lots of pages have been
dirtied by something like the dd process writing to one or more files.
 IO is given to the driver, and when it reaches the TCP layer, the
memory pressure causes an allocation for sk_buff memory to fail
(because presumably the allocation is GFP_ATOMIC -- I think this is
controlled via sk->allocation flag) and bdflush to be woken up to
flush pages.  This in turn issues more IO to the iSCSI driver, which
is then blocked.

I'm not sure this is precisely what is happening, and maybe my
ignorance of how the whole system works is causing me to ask this
question.  I do know though that the memory taken up by the page cache
constantly increases until it reaches very close to 1GB, then the
system freezes.

I also have a theory about why this problem is so reproducible in a
1GB configuration but isn't in a 512M configuration.  When the system
is locked up, the sysrq-t output often shows at least a few of the
threads trying to allocate a bounce_page (reason being I presume
because the iscsi driver does not set highmem_io in its template when
it registers to the scsi layer, and thus, does not use high memory). 
This doesn't explain why we haven't seen it on a larger memory system,
but perhaps it's just harder to reproduce (maybe we need a larger LUN,
longer dd sequence, etc).


Comment 9 Dave Wysochanski 2005-01-25 23:31:19 UTC
Created attachment 110231 [details]
sysrq-t and sysrq-m for most common machine state

Attachment shows the most common state I see the linux box in when this dd test
fails, for what it's worth.

Comment 10 Dave Wysochanski 2005-01-26 00:28:42 UTC
Ok, I found something that works, at least for this testcase.
The following settings to /proc/sys/vm/bdflush seems to cure this 
livelock/deadlock:
"30 500 0 0 100 300 50 30"

The default settings are found in fs/buffer.c:
union bdflush_param {
        struct {
                int nfract;     /* Percentage of buffer cache dirty to
                                   activate bdflush */
                int ndirty;     /* Maximum number of dirty blocks to
write out per
                                   wake-cycle */
                int dummy2;     /* old "nrefill" */
                int dummy3;     /* unused */
                int interval;   /* jiffies delay between kupdate
flushes */
                int age_buffer; /* Time for normal buffer to age
before we flush it */
                int nfract_sync;/* Percentage of buffer cache dirty to
                                   activate bdflush synchronously */
                int nfract_stop_bdflush; /* Percetange of buffer cache
dirty to stop bdflush */
                int dummy5;     /* unused */
        } b_un;
        unsigned int data[N_PARAM];
} bdf_prm = {{50, 500, 0, 0, 5*HZ, 30*HZ, 80, 50, 0}};


I'm not sure if there's on setting in particular that is more
significant that the others, if these settings don't have other
implications, or if this just makes the problem less probable (most
likely the case).  I was basically just playing around, trying to come
up with something that works as I read more of how all this works.  My
basic assumption was that the main root cause of this problem was that
there was too high of a %age of dirty pages in the page cache at any
given point of time, increasing the possibility that iscsi would need
to alloc a page that the vm system would choose to get from one of the
dirty iscsi page cache pages.



Comment 11 Dave Wysochanski 2005-01-26 17:50:05 UTC
It's starting to look to me like kswapd is the problem, though
admitedly, this is all still pretty new to me.  I'm currently running
rhel3, update3 in a loop of the dd commands as follows:
while true; do dd if=/dev/zero of=/iscsi_mnt/large_luns/10-1/file
bs=1024 count=5000000; done

What I notice is although the system does not totally hang, like in
the rhel3, update4 case, there are times when it freezes, and nothing
else runs.  At one of these times, I captured a sysrq-t trace, which
indicates kswapd is the only thread running, and the trace is
precisely the same trace as I see during most of the rhel3, update4
hangs I've encountered.  The trace is below.  It looks to me like it's
possible for kswapd to block waiting for memory while it's trying to
free memory, especially if it needs a bounce page.  There's the
emergency reserve pages that it can use, but perhaps these somehow get
completely depleted (by ???) so that kswapd too, can no longer make
forward progress.



kswapd        R 00000001  3768    11      1            13    12 (L-TLB)
Call Trace:   [<c011d570>] smp_apic_timer_interrupt [kernel] 0x70
(0xc38b5de0)
[<c011d500>] smp_apic_timer_interrupt [kernel] 0x0 (0xc38b5df8)
[<c0125154>] sys_sched_yield [kernel] 0xa4 (0xc38b5e34)
[<c0160974>] alloc_bounce_page [kernel] 0x94 (0xc38b5e4c)
[<c0160aac>] create_bounce [kernel] 0x5c (0xc38b5e58)
[<c01cfd48>] locate_hd_struct [kernel] 0x38 (0xc38b5e5c)
[<c01d01f9>] __make_request [kernel] 0x99 (0xc38b5e9c)
[<c0226969>] __kfree_skb [kernel] 0x139 (0xc38b5eb8)
[<c01d093a>] generic_make_request [kernel] 0xea (0xc38b5ef8)
[<c01d09d9>] submit_bh_rsector [kernel] 0x49 (0xc38b5f20)
[<c0168a9f>] sync_page_buffers [kernel] 0x8f (0xc38b5f3c)
[<c0168bf2>] try_to_free_buffers [kernel] 0x142 (0xc38b5f50)
[<c01548f3>] launder_page [kernel] 0x873 (0xc38b5f68)
[<c0155edb>] rebalance_dirty_zone [kernel] 0x9b (0xc38b5f8c)
[<c01563d9>] do_try_to_free_pages_kswapd [kernel] 0x1b9 (0xc38b5fac)
[<c0156508>] kswapd [kernel] 0x68 (0xc38b5fd0)
[<c01564a0>] kswapd [kernel] 0x0 (0xc38b5fe4)
[<c01095ad>] kernel_thread_helper [kernel] 0x5 (0xc38b5ff0)



Comment 12 Dave Wysochanski 2005-01-27 00:26:47 UTC
Looks like I spoke too soon about Update2.  Update2 looks like it has
a similar problem, though it's much harder to actually lockup.  I see
similar backtraces with kswapd seemingly hung trying to free pages
waiting for a page.

The latest info I have is at the time of the hang, it looks like
nr_emergency_pages is 0, and staying at zero.  It starts at 32 when
the system is running normally, then goes to 0 and stays there during
the hung machine state.

Somehow it sometimes gets out of this state (not sure how).  Next on
the list is to try and figure out who is consuming the emergency
pages, and how the deadlock/livelock is actually broken.

Any info on your end?

Comment 13 Dave Wysochanski 2005-01-27 15:48:50 UTC
The reason nr_emergency_pages is 0, and kswapd is blocked is because
there's no free pages left in ZONE_NORMAL.  Since there's no emergency
pages, and no ZONE_NORMAL pages, kswapd can't allocate a bounce buffer
to flush the bh on the page and free memory.

It looks like to me that the counts in the zone structure aren't being
updated correctly, because even after multiple 'sync' operations, and
even after the 'dd' process has exited, the inactive_dirty_pages and
inactive_laundry_pages counts are the highest of the counts, with the
free count near or equal to zero.  I could maybe understand
inactive_laundry_pages being high, but inactive_dirty_pages?  Only
after I unmount the iscsi device does the free count go back up, and
inactive_dirty_pages goes back to a small number or zero (NOTE: I'm
still not really sure what all these lists mean, but this looks wrong,
based on what I'd understand to be the most english speaking people's
common sense understanding of of "inactive" and "dirty".)

Comment 14 Larry Woodman 2005-01-27 19:57:21 UTC
kswapd is looping in alloc_bounce_page().  First it tries to
alloc_page() a lowmem bounce buffer and that fails because the free
and inactive_clean lists are empty.  Next, alloc_bounce_page() loops
until there is an emergency bounce page on the list.  As soon as an IO
completes that had a bounce buffer it is placed on the emergency list
and kswapd grabs it and exits the loop in alloc_bounce_page.  While
kswap is stuck in this loop any other callers to alloc_page end up
waiting in wakeup_kswapd because the free list is so low.  Also, since
everything is funneled behind kswapd(i.e. waiting in wakeup_kswapd)
the IOs without bounce pages that have completed while kswapd is stuck
in the alloc_bounce_pages loop stay on the inactive_laundry_list.  Its
kswapd that will post process the IO completions and move those pages
from the inactive_laundry_list to the inactive_clean_list and
eventually free_list.

So, I know why this is happening; kswap and bdflushd are firing
asynchronous writes out so much faster than they are completing it
eventually drains the free_list, inactive_clean_list and emergency
list for Normal zone pagecache pages and bounce buffers for highmem
pagecache pages.  The question is why does this only happen when iscsi
is used and not when local disks are used?  Is it because iscsi cant
keep up with kswapd and bdflushd where local disks can?  

I think I know how to fix this problem when it gets into this
situation: change the loop in alloc_bounce_pages() to call
alloc_page(GFP_DMA) after the yield() call if kswapd is the caller and
have kswapd back-off on writepage calls when the system is choked in
this state.  That will allow only kswapd to use DMA pages to get out
of this state and prevent it from getting back into this state,
eventually exhausting the DMA zone.  However, before I do that I
really need to know why we see 10000 or more bounce buffers
outstanding for significant periods of time.  And, BTW, the number of
page writes outstanding is MUCH larger than 10000.  Those bounce
buffers only represent highmem pagecache pages, there are ~240000
lowmem pages that dont even need bounce buffers!

In a nutshell I think we are seeing IO completion happening much
slower when iscsi is on use than when local disks are in use.

Larry Woodman
    

Comment 15 Dave Wysochanski 2005-01-27 21:29:12 UTC
I'm not sure it's really IO completing slower that's the problem, but
perhaps IO not completing at all.

For instance, in the traces I see where the system is completely
deadlocked, the iscsi-tx thread is blocked in the tcp layer waiting
for memory, presumably because memory is low, having been taken up by
the pagecache dirty buffers.

Could this be one possible scenario?  First, iscsi-tx blocks in TCP
waiting for memory to send IO.  (The reason for this is unclear to me,
but I'd guess if there's no limit to the memory the page cache can
consume, it's possible for that to be the reason.)  Next, kswapd is
woken up, and tries to free memory.  But he keeps picking pages that
need to go through the iscsi device, which is blocked in TCP, waiting
for memory.  Eventually kswapd and bdflushd get blocked as well.



iscsi-tx      S 00000001  3356  2583      1          2584  2582 (L-TLB)
Call Trace:   [<c0123f14>] schedule [kernel] 0x2f4 (0xf6ed3a84)
[<c0134fbc>] schedule_timeout [kernel] 0xbc (0xf6ed3ac8)
[<c0252003>] wait_for_tcp_memory [kernel] 0x1d3 (0xf6ed3b00)
[<c0227624>] alloc_skb [kernel] 0xc4 (0xf6ed3b48)
[<c0253c16>] tcp_sendmsg [kernel] 0xdb6 (0xf6ed3b60)
[<c0276692>] inet_sendmsg [kernel] 0x42 (0xf6ed3bd0)
[<c02237f8>] sock_sendmsg [kernel] 0x78 (0xf6ed3be4)
[<f8aee581>] iscsi_sendmsg [iscsi_sfnet] 0x61 (0xf6ed3c28)
[<f8af2cf5>] iscsi_xmit_task [iscsi_sfnet] 0x255 (0xf6ed3c40)
[<c024da3f>] ip_finish_output2 [kernel] 0xcf (0xf6ed3c60)
[<c02613cd>] tcp_send_probe0 [kernel] 0x1dd (0xf6ed3d74)
[<c025f783>] tcp_write_xmit [kernel] 0x1f3 (0xf6ed3d98)
[<c025333f>] tcp_sendmsg [kernel] 0x4df (0xf6ed3dc0)
[<c0276692>] inet_sendmsg [kernel] 0x42 (0xf6ed3dd4)
[<c02237f8>] sock_sendmsg [kernel] 0x78 (0xf6ed3de8)
[<c0276692>] inet_sendmsg [kernel] 0x42 (0xf6ed3e30)
[<c02237f8>] sock_sendmsg [kernel] 0x78 (0xf6ed3e44)
[<f8aee581>] iscsi_sendmsg [iscsi_sfnet] 0x61 (0xf6ed3e88)
[<f8af4c3d>] iscsi_xmit_ping [iscsi_sfnet] 0x15d (0xf6ed3ea0)
[<f8af378b>] iscsi_xmit_queued_cmnds [iscsi_sfnet] 0x20b (0xf6ed3f20)
[<f8aec870>] iscsi_task_times_out [iscsi_sfnet] 0x0 (0xf6ed3f28)




Comment 16 Tom Coughlan 2005-01-28 18:17:41 UTC
Dave, 

The first message in the log always seems to be:

iSCSI: 10 second timeout expired for session ...
iSCSI: session f5ef4000 xmit_data failed to send ... 
iSCSI: session f5ef4000 to iqn.1992-08.com.netapp:sn.33610450 dropped
iSCSI: bus 0 target 0 trying to establish session ...

I wonder if the iSCSI infinite command timeout is the cause of the
hang. Outstanding commands never timeout and return resources like
they do with all other I/O devices. 

My system gets pauses and errors but (almost) always recovers and
finishes the dd job, so I can't thest this easily.  Can you try
setting a command timeout of 60 seconds (to match normal SCSI), to see
if it gets you out of the hang?

Tom


Comment 17 Dave Wysochanski 2005-01-28 20:09:39 UTC
Thanks for the suggestion Tom.  However, I don't think the infinite
iscsi timeout is the cause for the hang.  At this point, I'm pretty
sure it's the page cache's ability to consume memory to the point that
it stops iscsi from making forward progress in the networking layer. 
I think this is the cause of the "10 second timeout" message, but I
have yet to verify that explicitly.

I set DiskCommandTimeout=60 in /etc/iscsi.conf, and did see a few IO
failure messages.  However, the system still deadlocks, pretty much as
as fast as before.

telnet> send break
SysRq : Show Memory
Mem-info:
Zone:DMA freepages:  2905 min:     0 low:     0 high:     0
Zone:Normal freepages: 66973 min:   767 low:  4032 high:  5792
Zone:HighMem freepages:   507 min:   255 low:   512 high:   768
Free pages:       70385 (   507 HighMem)
( Active: 8437/145829, inactive_laundry: 19032, inactive_clean: 98,
free: 70385 )
  aa:0 ac:0 id:0 il:0 ic:0 fr:2905
  aa:284 ac:2087 id:122376 il:17337 ic:34 fr:66973
  aa:1833 ac:4236 id:23453 il:1695 ic:64 fr:507
1*4kB 2*8kB 3*16kB 1*32kB 2*64kB 1*128kB 2*256kB 1*512kB 0*1024kB
1*2048kB 2*4096kB = 11620kB)
1*4kB 0*8kB 1*16kB 1*32kB 1*64kB 0*128kB 0*256kB 1*512kB 1*1024kB
0*2048kB 65*4096kB = 267892kB)
1*4kB 1*8kB 0*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 1*1024kB
0*2048kB 0*4096kB = 2028kB)
Swap cache: add 0, delete 0, find 0/0, race 0+0
9934 pages of slabcache
130 pages of kernel stacks
0 lowmem pagetables, 181 highmem pagetables
Free swap:       2104504kB
262144 pages of RAM
32768 pages of HIGHMEM
5001 reserved pages
172636 pages shared
0 pages swap cached

Red Hat Enterprise Linux AS release 3 (Taroon Update 4)
Kernel 2.4.21-15.ELsmp on an i686

tso login:
telnet> send break
SysRq : Show Memory
Mem-info:
Zone:DMA freepages:  2905 min:     0 low:     0 high:     0
Zone:Normal freepages:   788 min:   767 low:  4032 high:  5792
Zone:HighMem freepages:   291 min:   255 low:   512 high:   768
Free pages:        4030 (   304 HighMem)
( Active: 9390/178272, inactive_laundry: 49896, inactive_clean: 3442,
free: 4019 )
  aa:0 ac:0 id:0 il:0 ic:0 fr:2905
  aa:299 ac:2605 id:159024 il:44346 ic:3297 fr:798
  aa:2244 ac:4246 id:19228 il:5497 ic:266 fr:264
1*4kB 2*8kB 3*16kB 1*32kB 2*64kB 1*128kB 2*256kB 1*512kB 0*1024kB
1*2048kB 2*4096kB = 11620kB)
1*4kB 1*8kB 0*16kB 2*32kB 3*64kB 7*128kB 4*256kB 2*512kB 0*1024kB
0*2048kB 0*4096kB = 3212kB)
12*4kB 11*8kB 6*16kB 2*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB
0*2048kB 0*4096kB = 1256kB)
Swap cache: add 0, delete 0, find 0/0, race 0+0
8851 pages of slabcache
142 pages of kernel stacks
0 lowmem pagetables, 211 highmem pagetables
Free swap:       2104504kB
262144 pages of RAM
32768 pages of HIGHMEM
5001 reserved pages
190232 pages shared
0 pages swap cached
iSCSI: 10 second timeout expired for session f7994000, rx 232994, ping
233493, now 233994
iSCSI: session f7994000 xmit_data failed to send 65584 bytes, rc 42152
iSCSI: session f7994000 to iqn.1992-08.com.netapp:sn.66666666.rolex1
dropped
iSCSI: bus 0 target 2 trying to establish session f7994000 to portal
0, address 10.60.155.31 port 3260 group 1001
iSCSI: login phase for session f7994000 (rx 1803, tx 1802) timed out
at 235500, timeout was set for 235500
iSCSI: session f7994000 connect timed out at 235502
iSCSI: session f7994000 to iqn.1992-08.com.netapp:sn.66666666.rolex1
waiting 1 seconds before next login attempt
iSCSI: bus 0 target 2 trying to establish session f7994000 to portal
0, address 10.60.155.31 port 3260 group 1001
iSCSI: login phase for session f7994000 (rx 1803, tx 1802) timed out
at 237107, timeout was set for 237107
iSCSI: session f7994000 connect timed out at 237109
iSCSI: session f7994000 to iqn.1992-08.com.netapp:sn.66666666.rolex1
waiting 1 seconds before next login attempt
iSCSI: bus 0 target 2 trying to establish session f7994000 to portal
1, address 10.60.155.225 port 3260 group 1001
iSCSI: login phase for session f7994000 (rx 1803, tx 1802) timed out
at 238714, timeout was set for 238714
iSCSI: session f7994000 connect timed out at 238716
iSCSI: session f7994000 to iqn.1992-08.com.netapp:sn.66666666.rolex1
waiting 1 seconds before next login attempt
iSCSI: bus 0 target 2 trying to establish session f7994000 to portal
2, address 10.60.155.226 port 3260 group 1001
Neighbour table overflow.
iSCSI: session f7994000 to iqn.1992-08.com.netapp:sn.66666666.rolex1
failed to connect, rc -105
iSCSI: session f7994000 connect failed at 238823
iSCSI: session f7994000 iSCSI: session f7994000 retrying all the
portals again, since the portal list got exhausted
iSCSI: session f7994000 to iqn.1992-08.com.netapp:sn.66666666.rolex1
waiting 1 seconds before next login attempt
iSCSI: session f7994000 failing normal command f7b62a00 cdb 0x2a to (2
0 2 0) at 238903, retries 0, allowed 5
iSCSI: session f7994000 failing normal command f6f03800 cdb 0x2a to (2
0 2 0) at 238905, retries 0, allowed 5
iSCSI: session f7994000 failing normal command f7418200 cdb 0x2a to (2
0 2 0) at 238907, retries 0, allowed 5
iSCSI: session f7994000 failing normal command f7418400 cdb 0x2a to (2
0 2 0) at 238909, retries 0, allowed 5
iSCSI: session f7994000 failing normal command f6e6ec00 cdb 0x2a to (2
0 2 0) at 238911, retries 0, allowed 5
iSCSI: session f7994000 failing normal command f6e6e600 cdb 0x2a to (2
0 2 0) at 238913, retries 0, allowed 5
iSCSI: session f7994000 failing normal command f6e6ee00 cdb 0x2a to (2
0 2 0) at 238915, retries 0, allowed 5
iSCSI: session f7994000 failing normal command f6e6e800 cdb 0x2a to (2
0 2 0) at 238917, retries 0, allowed 5
iSCSI: session f7994000 failing normal command f6f03600 cdb 0x2a to (2
0 2 0) at 238919, retries 0, allowed 5
iSCSI: session f7994000 failing normal command f6e6ea00 cdb 0x2a to (2
0 2 0) at 238922, retries 0, allowed 5
iSCSI: session f7994000 failing normal command f6f03a00 cdb 0x2a to (2
0 2 0) at 238924, retries 0, allowed 5
iSCSI: session f7994000 failing normal command f6f03c00 cdb 0x2a to (2
0 2 0) at 238926, retries 0, allowed 5
iSCSI: bus 0 target 2 trying to establish session f7994000 to portal
0, address 10.60.155.31 port 3260 group 1001
 I/O error: dev 08:21, sector 4530984
 I/O error: dev 08:21, sector 4531240
 I/O error: dev 08:21, sector 4531496
 I/O error: dev 08:21, sector 4531752
 I/O error: dev 08:21, sector 4532008
 I/O error: dev 08:21, sector 4532264
 I/O error: dev 08:21, sector 4532520
 I/O error: dev 08:21, sector 4532776
 I/O error: dev 08:21, sector 4533032
 I/O error: dev 08:21, sector 4533280
 I/O error: dev 08:21, sector 4533536
 I/O error: dev 08:21, sector 4533792
iSCSI: bus 0 target 2 established session f7994000 #2, portal 0,
address 10.60.155.31 port 3260 group 1001

telnet> send break
iSCSI: 10 second timeout expired for session f7994000, rx 239006, ping
239505, now 240006
iSCSI: session f7994000 xmit_data failed to send 65584 bytes, rc 61792
iSCSI: session f7994000 to iqn.1992-08.com.netapp:sn.66666666.rolex1
dropped
iSCSI: bus 0 target 2 trying to establish session f7994000 to portal
0, address 10.60.155.31 port 3260 group 1001
SysRq : Show Memory
Mem-info:
Zone:DMA freepages:  2905 min:     0 low:     0 high:     0
Zone:Normal freepages:     0 min:   767 low:  4032 high:  5792
Zone:HighMem freepages:   301 min:   255 low:   512 high:   768
Free pages:        3206 (   301 HighMem)
( Active: 9456/168537, inactive_laundry: 58481, inactive_clean: 570,
free: 3206 )
  aa:0 ac:0 id:0 il:0 ic:0 fr:2905
  aa:299 ac:2668 id:158222 il:44412 ic:0 fr:0
  aa:2244 ac:4245 id:10315 il:14069 ic:570 fr:301
1*4kB 2*8kB 3*16kB 1*32kB 2*64kB 1*128kB 2*256kB 1*512kB 0*1024kB
1*2048kB 2*4096kB = 11620kB)
0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB
0*2048kB 0*4096kB = 0kB)
1*4kB 0*8kB 9*16kB 3*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB
0*2048kB 0*4096kB = 1204kB)
Swap cache: add 0, delete 0, find 0/0, race 0+0
8785 pages of slabcache
142 pages of kernel stacks
0 lowmem pagetables, 211 highmem pagetables
Free swap:       2104504kB
262144 pages of RAM
32768 pages of HIGHMEM
5001 reserved pages
194715 pages shared
0 pages swap cached
iSCSI: login phase for session f7994000 (rx 1803, tx 1802) timed out
at 241512, timeout was set for 241512
iSCSI: session f7994000 connect timed out at 241514
iSCSI: session f7994000 to iqn.1992-08.com.netapp:sn.66666666.rolex1
waiting 1 seconds before next login attempt
iSCSI: bus 0 target 2 trying to establish session f7994000 to portal
0, address 10.60.155.31 port 3260 group 1001
iSCSI: session f7994000 recv_login_pdu: connection closed
iSCSI: session f7994000 login I/O error, failed to receive a PDU
iSCSI: session f7994000 retrying login to portal 1 at 242242
iSCSI: session f7994000 to iqn.1992-08.com.netapp:sn.66666666.rolex1
waiting 1 seconds before next login attempt
iSCSI: bus 0 target 2 trying to establish session f7994000 to portal
1, address 10.60.155.225 port 3260 group 1001
iSCSI: session f7994000 recv_login_pdu: connection closed
iSCSI: session f7994000 login I/O error, failed to receive a PDU
iSCSI: session f7994000 retrying login to portal 2 at 242969
iSCSI: session f7994000 to iqn.1992-08.com.netapp:sn.66666666.rolex1
waiting 1 seconds before next login attempt
iSCSI: bus 0 target 2 trying to establish session f7994000 to portal
2, address 10.60.155.226 port 3260 group 1001
Neighbour table overflow.
iSCSI: session f7994000 to iqn.1992-08.com.netapp:sn.66666666.rolex1
failed to connect, rc -105
iSCSI: session f7994000 connect failed at 243076
iSCSI: session f7994000 iSCSI: session f7994000 retrying all the
portals again, since the portal list got exhausted
iSCSI: session f7994000 to iqn.1992-08.com.netapp:sn.66666666.rolex1
waiting 1 seconds before next login attempt
iSCSI: bus 0 target 2 trying to establish session f7994000 to portal
0, address 10.60.155.31 port 3260 group 1001


Comment 18 Dave Wysochanski 2005-01-30 02:42:32 UTC
One more thing I should have said before.

The first messages I see on the console are not always the "10 second
timeout expired" messages.  Sometimes I initially see the following,
indicating that iscsi is having difficulty getitng memory:

tso login: iSCSI: session f7970000 to (2 0 3 *) couldn't allocate task
at 50593, #alloc'd tasks=11437
iSCSI: session f7970000 to (2 0 3 *) couldn't allocate task at 50595,
#alloc'd tasks=11437
iSCSI: session f7970000 to (2 0 3 *) done using preallocated task
f6088f74 at 50597
iSCSI: session f7970000 to (2 0 3 *) couldn't allocate task at 50605,
#alloc'd tasks=11484
iSCSI: session f7970000 to (2 0 3 *) couldn't allocate task at 50607,
#alloc'd tasks=11484


I think the 10-second expired message is the result of the networking
layer not getting memory, and thus no progress being made with iscsi.
   The above messages indicate iscsi task allocation failures.  So I
think the different messages just show who sees the memory pressure
first - the root of the problem is the memory consumption is out of whack.





Comment 19 Dave Wysochanski 2005-01-30 02:51:32 UTC
I'm also running the same test on rhas2.1, mem=1G, and it seems to be
running fine.  It runs a lot smoother.  I don't see the periodic
freezes either.  I'm going to let the test run overnight though, just
to be sure, but so far so good (test has been running for about 40
min's now).

It's looking to me this is definately the vm in rhel3 that's causing
the problem.

Seems like iSCSI software wouldn't be the only thing affected by this
vm change in rhel3.  Wouldn't NFS suffer the same fate?  Are there
reports of lockups with NFS on rhel3?

Comment 20 Dave Wysochanski 2005-01-30 03:10:31 UTC
Just so you know, I am exhausting any other option that I can think of
to workaround this problem in either iscsi or the networking layer. 
But so far no luck.

Comment 21 Dave Wysochanski 2005-02-03 19:19:19 UTC
Since this seems related to bounce pages, one possible way to
alleviate this is to backport the sendpage code from the 2.6 based
iscsi stuff into the 2.4 stuff.

Since RHEL3 implements kmap_noblock() the "multiple kmap() can block"
problem essentially disappears.

This is just an unproven idea at this point.

Comment 22 Tom Coughlan 2005-03-08 15:21:52 UTC
*** Bug 150480 has been marked as a duplicate of this bug. ***

Comment 23 Tom Coughlan 2005-03-08 15:42:36 UTC
We have been testing several changes to the VM, but so far they have only
resulted in minor improvements. This will not be fixed in time for U5.

Regarding the suggestion in #21 about removing the need for bounce buffers, I
agree that is a very desirable approach. As a caution, though, I recall this
statement from the iSCSI mailing list some time ago:

"When an HBA driver declares highmem_io, the midlayer assumes it is adriver for
a PCI device and references a PCI structure associated withthe HBA.  But in the
ISCSI case, there is in fact no PCI device and themidlayer follows a wild
pointer looking for the PCI structure."

So, some work in the SCSI midlayer may be needed to make this possible. 


Comment 24 Mike Christie 2005-06-09 20:01:46 UTC
Bryan's pci highmem comment is correct.

Comment 25 Dave Wysochanski 2005-07-06 18:30:47 UTC
We can repro this now with different memory setups if we use more LUNs.

I'll attach more info.

Comment 26 Dave Wysochanski 2005-07-06 19:39:26 UTC
More specifically, we've seen a failure on an ia-64 machine (9GB RAM) when
we run the 'dt' open source tool against 15-20 luns or so.  Mike has given
us some modified iscsi driver code that sets dma mask to see if we can avoid
bounce buffers on ia-64, and this subsequent hang.

I'll zip up the tarball with the various failure info, but here's some
basic details of the test from our engineer:

Data Test usage info:
[root@ibmx450-rtp02 root]# dt
DataTest V13.2 (1/22/00); SQE Build 23

dt was run using multiple 1GB luns (with partitions and ext3 filesystems of 
roughly the same size), and commands such as this one:
dt of=/usr/tmp/nate_mnts/mnt_12398_ext3_0/nate_test limit=100m bs=128k procs=8
oncerr=abort disable=pstats errors=1 runtime
=1h

After multiple runs, I found that the ia-64 froze at least 50% of the time when 
I used 20 or more luns, and it still seemed to do so when I brought the lun 
count down to between 14 and 16.  When I got down to 12 and 13 luns, the machine
would typically stay up, but TCP/IP packets would start getting dropped 
consistently when the lun count was right around 13.  



Comment 27 Dave Wysochanski 2005-07-06 19:53:39 UTC
Created attachment 116429 [details]
system info describing the hangs on i386 and ia-64 with 20+ luns dt test

Comment 28 Dave Wysochanski 2005-07-06 20:28:11 UTC
Attachment I just created has hangs we've seen when running rhel3 u5 on both
i386 and ia-64 with larger memory setups but a 'dt' test with 20 or more 
luns, as described above.

Will soon attach the exact dt binaries used.

Comment 29 Dave Wysochanski 2005-07-06 20:29:13 UTC
Created attachment 116431 [details]
i386 dt version

Comment 30 Dave Wysochanski 2005-07-06 20:29:59 UTC
Created attachment 116432 [details]
ia-64 dt binary

Comment 31 Mike Christie 2005-07-07 21:22:28 UTC
There is a second issue on ia64 where it is using bounce buffers and using more
mem than it should or needs to. I think I have a hack for that and made a
seperate  bugzilla here https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=162706.

Comment 33 Dave Wysochanski 2005-08-18 14:25:53 UTC
Any progress on this for update7 on the other platforms (i386 in particular)?

Thanks.

Comment 34 Ernie Petrides 2005-08-31 19:36:36 UTC
Removing block on bug 167215, which is against RHEL4.

Comment 35 Tom Coughlan 2005-09-16 13:18:06 UTC
There was some discussion of this problem in a different forum, where two
questions were raised: to what extent is this problem limited to x86?, and can
we expect to see an improvement going from RHEL 3 to RHEL 4?  The following
answers were provided by Larry Woodman. I'm copying them here to make the
information available to a broader audience. 

Summary: in all practical situations the memory deadlock issue is limited to x86
(barring bugs on the other architectures, such as comment 31). The situation
with x86 should improve in RHEL 4. 

----------------

This problem is almost certainly limited to x86 because of the way that the
kernel divides RAM into Lowmem and Highmem.  The kernel divides RAM into two
zones: Lowmem, or the first 896MB and Highmem, everything above 896MB.  Since
Lowmem is perminently mapped into the kernel's virtual address space and Highmem
is not, all kernel data structures and DMA buffers that must be accessable via
the device driver ISR must be allocated from Lowmem.  Highmem, which in contrast
is used to cache file system data(pagecache and buffermem) and to map into the
anonymous virtual reagions of user address spaces must be temporarily mapped a
page at a time into the kernel's virtual address space before it can be accessed
by a devnce driver.

When a subsystem like iscsi is used, the data buffers must be allocated out of
Lowmem so that they are: 1.) DMA-able so the NIC can DMA into these buffers and
2.) mapped into the kernel's virtual address space so that the device driver's
ISR can look at the data packet.  Further more, when an x86 system exhausts
memory and kswapd must reclaim those pages of memory, the Highmem pages must be
copied or bounced to Lowmem pages before they can be passed down to the network
layer.  This bouncing from Highmem to Lowmem in conjunction with the need to
allocate all kernel data structures from Lowmem, combined with the realitively
small, limited Lowmem size applies much more pressure on the Lowmem region of
RAM than the Highmem region of RAM.  This pressure can totally exhaust Lowmem
long before Highmem is exhausted on an x86 system and therefore lead to deadlocks.

Since the 64 bit architectures have a very large kernel virtual address space,
it can map all of RAM into one virtually contiguous kernel address space and
therefore totally eliminate the need to divide memory into Lowmem and Highmem. 
Since all RAM is mapped into the kernel address space, kernel data structures
and DMA buffers that must be accessable via the device driver ISR can be
allocated from any page or RAM.  This will prevent any and all uneven memory
pressure on 64 bit systems and therefore eliminate the possibility of
deadlocking the system when using iscsi.

One of the key differences between the RHEL3/2.4 kernel and the RHEL4/2.6 kernel
is failure paths in the pageout code.  In the 2.4 kernel once the pageout code
determines that it is committing a page to be written out to disk it must go. 
There are no failure return codes therefore the actual device routine must
succeed.  This might mean spinning in a tight loop waiting for some resource
such as a buffer or some other pecee of memory to become available.  This is
contrast to the 2.6 kernel where the paeout code has actual failure return
codes.  If for some reason the pageout can not or will not take place at
properly the page is re-marked as dirty and placed back on the appropriate list.
 This fundimental differenced makes the likelyhood of deadlocking in the VM code
much lower in RHEL4 than RHEL3.


Note You need to log in before you can comment on or make changes to this bug.