Description of problem: Memory deadlock when creating large file on system with 1GB ram. The 1GB of ram does seem to be significant. If I boot the same machine with "mem=512M", this problem does not seem to occur. Also, the problem does not seem to occur on 2GB or 4GB systems, at least from what we can tell. This may just mean it's less reproducible on these systems, but we've only ever seen it on 1GB i386 systems. Version-Release number of selected component (if applicable): linux-iscsi-3.6.2 How reproducible: Happens every time, within a minute or so. Steps to Reproduce: 1. Create a single 10GB LUN 2. Map to linux box, start iscsi 3. Create 1 partition that covers the whole disk using fdisk 4. Run "mkfs -t ext3" on the partition created in step 4, and mount the lun at /iscsi_mnt/large_luns/10-1/ 5. Run the following 'dd' command, which should create a single 5GB file on the filesystem: dd if=/dev/zero of=/iscsi_mnt/large_luns/10-1/file bs=1024 count=50000000 Actual results: Machine lockup as a result of an OOM condition, with messages similar to the following on the console, indicating ENOBUFs error from a socket creation call: iSCSI: session f79e0000 failed to create socket, rc -105 iSCSI: session f79e0000 connect failed at 259549 iSCSI: session f79e0000 to iqn.1992-08.com.netapp:sn.50393227.yanni waiting 1 seconds before next login attempt iSCSI: bus 0 target 1 trying to establish session f79e0000 to portal 1, address 10.60.155.93 port 3260 group 1 iSCSI: session f79e0000 failed to create socket, rc -105 iSCSI: session f79e0000 connect failed at 259655 Expected results: Test to run to completion, without any lockups. Additional info: I'm not sure if this is specific to this RedHat release, or is a more broad problem in the linux-iscsi-3.6.2 code. I have not filed a bug with sourceforge, but can provided we can repro this problem on another system with a different kernel (maybe RHAS 2.1). I'm also not sure if this is a i386 specific problem, but so far this is the only arch we've run on.
Created attachment 110079 [details] Output from sysrq-m and sysrq-t for the hang I can probably get crashdump info if you need it. I don't have this setup currently. Let me know if you want more info. This is really easy to reproduce though, so you can probably do it on your end easily.
Created attachment 110080 [details] Messages file from machine that failed with deadlock This file contains all the messages from today, which includes a few of these failures. For the latest failure I reported this bug against, start from the latest log messages and work backwards.
I just tried this on RHAS 2.1 with 2.4.9-e.34smp and linux-iscsi-3.6.2 stock from sourceforge and it ran fine. There was one typeo in the original dd command (one too many '0's). The correct command should be: dd if=/dev/zero of=/iscsi_mnt/large_luns/10-1/file bs=1024 count=5000000
Guys, Is it possible that the 'dd' command above dirties so many pages that the networking stack can't make forward progress? Is there any way to throttle the buffer cache's use of pages, e.g. limit the # of pages that can be dirtied at one time? Maybe I'm way off here. I'm just not sure how to proceed, and this looks like a pretty serious problem that's easily reproducible on RHEL3, U4's linux-iscsi.
It's really odd to me that this problem doesn't occur when I boot with "mem=512M". When I look at the free pages via sysrq-m, they seem to be about the same in 512M case as in the 1G case (around 4K pages free), but in the 1G case, the machine is frozen, unable to make progress, and in the 512M case, the machine is fine. Note that the iscsi driver does not set "highmem_io", so it does not call kmap to use highmem pages.
I just ran the Update2 kernel, and I don't see the deadlock. I ran the test multiple times. At one point, I see a few messages on the console indicating iscsi is seeing memory pressure, and the host is a bit unresponsive, but it's nothing like the Update4 kernel behavior. The Update4 kernel deadlocks everytime I run the test, within a minute or so. I compared the settings in /proc/sys/vm, and see the the pagecache entry was changed from update2 to update4 as follows: - update2:1 15 100 - update4:1 15 30 I tried changing the update4 values back to the update2 values and re-running the test, but the change didn't seem to make much of a difference. Is there a more fundamental vm difference between update2 and update4?
I have been attempting to reproduce this on U4. I believe that I did see exactly the same problem once. On my other attempts I saw system pauses and messages about connections lost/regained, but no lockup. I will vary a few thigs and see if I can get it to reproduce more readily. The sysreq-m data shows than there is some memory available that probably should have been allocated for the iSCSI request in this situation. I am working with the VM experts to figure out what is going on. There is not a way to throttle the buffer cache's use of pages. As I unserstand it, such schemes cause more harm than good, as long as the VM is working correctly. Thanks for checking U2. We will look at the U2/U4 diffs for hints. Tom
It seems like there would be a fundamental problem with iSCSI if there's no way to throttle the page cache from using memory? What guarantees the networking stack can get the memory it needs to make forward progress? What prevents the following schenario? Lots of pages have been dirtied by something like the dd process writing to one or more files. IO is given to the driver, and when it reaches the TCP layer, the memory pressure causes an allocation for sk_buff memory to fail (because presumably the allocation is GFP_ATOMIC -- I think this is controlled via sk->allocation flag) and bdflush to be woken up to flush pages. This in turn issues more IO to the iSCSI driver, which is then blocked. I'm not sure this is precisely what is happening, and maybe my ignorance of how the whole system works is causing me to ask this question. I do know though that the memory taken up by the page cache constantly increases until it reaches very close to 1GB, then the system freezes. I also have a theory about why this problem is so reproducible in a 1GB configuration but isn't in a 512M configuration. When the system is locked up, the sysrq-t output often shows at least a few of the threads trying to allocate a bounce_page (reason being I presume because the iscsi driver does not set highmem_io in its template when it registers to the scsi layer, and thus, does not use high memory). This doesn't explain why we haven't seen it on a larger memory system, but perhaps it's just harder to reproduce (maybe we need a larger LUN, longer dd sequence, etc).
Created attachment 110231 [details] sysrq-t and sysrq-m for most common machine state Attachment shows the most common state I see the linux box in when this dd test fails, for what it's worth.
Ok, I found something that works, at least for this testcase. The following settings to /proc/sys/vm/bdflush seems to cure this livelock/deadlock: "30 500 0 0 100 300 50 30" The default settings are found in fs/buffer.c: union bdflush_param { struct { int nfract; /* Percentage of buffer cache dirty to activate bdflush */ int ndirty; /* Maximum number of dirty blocks to write out per wake-cycle */ int dummy2; /* old "nrefill" */ int dummy3; /* unused */ int interval; /* jiffies delay between kupdate flushes */ int age_buffer; /* Time for normal buffer to age before we flush it */ int nfract_sync;/* Percentage of buffer cache dirty to activate bdflush synchronously */ int nfract_stop_bdflush; /* Percetange of buffer cache dirty to stop bdflush */ int dummy5; /* unused */ } b_un; unsigned int data[N_PARAM]; } bdf_prm = {{50, 500, 0, 0, 5*HZ, 30*HZ, 80, 50, 0}}; I'm not sure if there's on setting in particular that is more significant that the others, if these settings don't have other implications, or if this just makes the problem less probable (most likely the case). I was basically just playing around, trying to come up with something that works as I read more of how all this works. My basic assumption was that the main root cause of this problem was that there was too high of a %age of dirty pages in the page cache at any given point of time, increasing the possibility that iscsi would need to alloc a page that the vm system would choose to get from one of the dirty iscsi page cache pages.
It's starting to look to me like kswapd is the problem, though admitedly, this is all still pretty new to me. I'm currently running rhel3, update3 in a loop of the dd commands as follows: while true; do dd if=/dev/zero of=/iscsi_mnt/large_luns/10-1/file bs=1024 count=5000000; done What I notice is although the system does not totally hang, like in the rhel3, update4 case, there are times when it freezes, and nothing else runs. At one of these times, I captured a sysrq-t trace, which indicates kswapd is the only thread running, and the trace is precisely the same trace as I see during most of the rhel3, update4 hangs I've encountered. The trace is below. It looks to me like it's possible for kswapd to block waiting for memory while it's trying to free memory, especially if it needs a bounce page. There's the emergency reserve pages that it can use, but perhaps these somehow get completely depleted (by ???) so that kswapd too, can no longer make forward progress. kswapd R 00000001 3768 11 1 13 12 (L-TLB) Call Trace: [<c011d570>] smp_apic_timer_interrupt [kernel] 0x70 (0xc38b5de0) [<c011d500>] smp_apic_timer_interrupt [kernel] 0x0 (0xc38b5df8) [<c0125154>] sys_sched_yield [kernel] 0xa4 (0xc38b5e34) [<c0160974>] alloc_bounce_page [kernel] 0x94 (0xc38b5e4c) [<c0160aac>] create_bounce [kernel] 0x5c (0xc38b5e58) [<c01cfd48>] locate_hd_struct [kernel] 0x38 (0xc38b5e5c) [<c01d01f9>] __make_request [kernel] 0x99 (0xc38b5e9c) [<c0226969>] __kfree_skb [kernel] 0x139 (0xc38b5eb8) [<c01d093a>] generic_make_request [kernel] 0xea (0xc38b5ef8) [<c01d09d9>] submit_bh_rsector [kernel] 0x49 (0xc38b5f20) [<c0168a9f>] sync_page_buffers [kernel] 0x8f (0xc38b5f3c) [<c0168bf2>] try_to_free_buffers [kernel] 0x142 (0xc38b5f50) [<c01548f3>] launder_page [kernel] 0x873 (0xc38b5f68) [<c0155edb>] rebalance_dirty_zone [kernel] 0x9b (0xc38b5f8c) [<c01563d9>] do_try_to_free_pages_kswapd [kernel] 0x1b9 (0xc38b5fac) [<c0156508>] kswapd [kernel] 0x68 (0xc38b5fd0) [<c01564a0>] kswapd [kernel] 0x0 (0xc38b5fe4) [<c01095ad>] kernel_thread_helper [kernel] 0x5 (0xc38b5ff0)
Looks like I spoke too soon about Update2. Update2 looks like it has a similar problem, though it's much harder to actually lockup. I see similar backtraces with kswapd seemingly hung trying to free pages waiting for a page. The latest info I have is at the time of the hang, it looks like nr_emergency_pages is 0, and staying at zero. It starts at 32 when the system is running normally, then goes to 0 and stays there during the hung machine state. Somehow it sometimes gets out of this state (not sure how). Next on the list is to try and figure out who is consuming the emergency pages, and how the deadlock/livelock is actually broken. Any info on your end?
The reason nr_emergency_pages is 0, and kswapd is blocked is because there's no free pages left in ZONE_NORMAL. Since there's no emergency pages, and no ZONE_NORMAL pages, kswapd can't allocate a bounce buffer to flush the bh on the page and free memory. It looks like to me that the counts in the zone structure aren't being updated correctly, because even after multiple 'sync' operations, and even after the 'dd' process has exited, the inactive_dirty_pages and inactive_laundry_pages counts are the highest of the counts, with the free count near or equal to zero. I could maybe understand inactive_laundry_pages being high, but inactive_dirty_pages? Only after I unmount the iscsi device does the free count go back up, and inactive_dirty_pages goes back to a small number or zero (NOTE: I'm still not really sure what all these lists mean, but this looks wrong, based on what I'd understand to be the most english speaking people's common sense understanding of of "inactive" and "dirty".)
kswapd is looping in alloc_bounce_page(). First it tries to alloc_page() a lowmem bounce buffer and that fails because the free and inactive_clean lists are empty. Next, alloc_bounce_page() loops until there is an emergency bounce page on the list. As soon as an IO completes that had a bounce buffer it is placed on the emergency list and kswapd grabs it and exits the loop in alloc_bounce_page. While kswap is stuck in this loop any other callers to alloc_page end up waiting in wakeup_kswapd because the free list is so low. Also, since everything is funneled behind kswapd(i.e. waiting in wakeup_kswapd) the IOs without bounce pages that have completed while kswapd is stuck in the alloc_bounce_pages loop stay on the inactive_laundry_list. Its kswapd that will post process the IO completions and move those pages from the inactive_laundry_list to the inactive_clean_list and eventually free_list. So, I know why this is happening; kswap and bdflushd are firing asynchronous writes out so much faster than they are completing it eventually drains the free_list, inactive_clean_list and emergency list for Normal zone pagecache pages and bounce buffers for highmem pagecache pages. The question is why does this only happen when iscsi is used and not when local disks are used? Is it because iscsi cant keep up with kswapd and bdflushd where local disks can? I think I know how to fix this problem when it gets into this situation: change the loop in alloc_bounce_pages() to call alloc_page(GFP_DMA) after the yield() call if kswapd is the caller and have kswapd back-off on writepage calls when the system is choked in this state. That will allow only kswapd to use DMA pages to get out of this state and prevent it from getting back into this state, eventually exhausting the DMA zone. However, before I do that I really need to know why we see 10000 or more bounce buffers outstanding for significant periods of time. And, BTW, the number of page writes outstanding is MUCH larger than 10000. Those bounce buffers only represent highmem pagecache pages, there are ~240000 lowmem pages that dont even need bounce buffers! In a nutshell I think we are seeing IO completion happening much slower when iscsi is on use than when local disks are in use. Larry Woodman
I'm not sure it's really IO completing slower that's the problem, but perhaps IO not completing at all. For instance, in the traces I see where the system is completely deadlocked, the iscsi-tx thread is blocked in the tcp layer waiting for memory, presumably because memory is low, having been taken up by the pagecache dirty buffers. Could this be one possible scenario? First, iscsi-tx blocks in TCP waiting for memory to send IO. (The reason for this is unclear to me, but I'd guess if there's no limit to the memory the page cache can consume, it's possible for that to be the reason.) Next, kswapd is woken up, and tries to free memory. But he keeps picking pages that need to go through the iscsi device, which is blocked in TCP, waiting for memory. Eventually kswapd and bdflushd get blocked as well. iscsi-tx S 00000001 3356 2583 1 2584 2582 (L-TLB) Call Trace: [<c0123f14>] schedule [kernel] 0x2f4 (0xf6ed3a84) [<c0134fbc>] schedule_timeout [kernel] 0xbc (0xf6ed3ac8) [<c0252003>] wait_for_tcp_memory [kernel] 0x1d3 (0xf6ed3b00) [<c0227624>] alloc_skb [kernel] 0xc4 (0xf6ed3b48) [<c0253c16>] tcp_sendmsg [kernel] 0xdb6 (0xf6ed3b60) [<c0276692>] inet_sendmsg [kernel] 0x42 (0xf6ed3bd0) [<c02237f8>] sock_sendmsg [kernel] 0x78 (0xf6ed3be4) [<f8aee581>] iscsi_sendmsg [iscsi_sfnet] 0x61 (0xf6ed3c28) [<f8af2cf5>] iscsi_xmit_task [iscsi_sfnet] 0x255 (0xf6ed3c40) [<c024da3f>] ip_finish_output2 [kernel] 0xcf (0xf6ed3c60) [<c02613cd>] tcp_send_probe0 [kernel] 0x1dd (0xf6ed3d74) [<c025f783>] tcp_write_xmit [kernel] 0x1f3 (0xf6ed3d98) [<c025333f>] tcp_sendmsg [kernel] 0x4df (0xf6ed3dc0) [<c0276692>] inet_sendmsg [kernel] 0x42 (0xf6ed3dd4) [<c02237f8>] sock_sendmsg [kernel] 0x78 (0xf6ed3de8) [<c0276692>] inet_sendmsg [kernel] 0x42 (0xf6ed3e30) [<c02237f8>] sock_sendmsg [kernel] 0x78 (0xf6ed3e44) [<f8aee581>] iscsi_sendmsg [iscsi_sfnet] 0x61 (0xf6ed3e88) [<f8af4c3d>] iscsi_xmit_ping [iscsi_sfnet] 0x15d (0xf6ed3ea0) [<f8af378b>] iscsi_xmit_queued_cmnds [iscsi_sfnet] 0x20b (0xf6ed3f20) [<f8aec870>] iscsi_task_times_out [iscsi_sfnet] 0x0 (0xf6ed3f28)
Dave, The first message in the log always seems to be: iSCSI: 10 second timeout expired for session ... iSCSI: session f5ef4000 xmit_data failed to send ... iSCSI: session f5ef4000 to iqn.1992-08.com.netapp:sn.33610450 dropped iSCSI: bus 0 target 0 trying to establish session ... I wonder if the iSCSI infinite command timeout is the cause of the hang. Outstanding commands never timeout and return resources like they do with all other I/O devices. My system gets pauses and errors but (almost) always recovers and finishes the dd job, so I can't thest this easily. Can you try setting a command timeout of 60 seconds (to match normal SCSI), to see if it gets you out of the hang? Tom
Thanks for the suggestion Tom. However, I don't think the infinite iscsi timeout is the cause for the hang. At this point, I'm pretty sure it's the page cache's ability to consume memory to the point that it stops iscsi from making forward progress in the networking layer. I think this is the cause of the "10 second timeout" message, but I have yet to verify that explicitly. I set DiskCommandTimeout=60 in /etc/iscsi.conf, and did see a few IO failure messages. However, the system still deadlocks, pretty much as as fast as before. telnet> send break SysRq : Show Memory Mem-info: Zone:DMA freepages: 2905 min: 0 low: 0 high: 0 Zone:Normal freepages: 66973 min: 767 low: 4032 high: 5792 Zone:HighMem freepages: 507 min: 255 low: 512 high: 768 Free pages: 70385 ( 507 HighMem) ( Active: 8437/145829, inactive_laundry: 19032, inactive_clean: 98, free: 70385 ) aa:0 ac:0 id:0 il:0 ic:0 fr:2905 aa:284 ac:2087 id:122376 il:17337 ic:34 fr:66973 aa:1833 ac:4236 id:23453 il:1695 ic:64 fr:507 1*4kB 2*8kB 3*16kB 1*32kB 2*64kB 1*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB 2*4096kB = 11620kB) 1*4kB 0*8kB 1*16kB 1*32kB 1*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 0*2048kB 65*4096kB = 267892kB) 1*4kB 1*8kB 0*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 2028kB) Swap cache: add 0, delete 0, find 0/0, race 0+0 9934 pages of slabcache 130 pages of kernel stacks 0 lowmem pagetables, 181 highmem pagetables Free swap: 2104504kB 262144 pages of RAM 32768 pages of HIGHMEM 5001 reserved pages 172636 pages shared 0 pages swap cached Red Hat Enterprise Linux AS release 3 (Taroon Update 4) Kernel 2.4.21-15.ELsmp on an i686 tso login: telnet> send break SysRq : Show Memory Mem-info: Zone:DMA freepages: 2905 min: 0 low: 0 high: 0 Zone:Normal freepages: 788 min: 767 low: 4032 high: 5792 Zone:HighMem freepages: 291 min: 255 low: 512 high: 768 Free pages: 4030 ( 304 HighMem) ( Active: 9390/178272, inactive_laundry: 49896, inactive_clean: 3442, free: 4019 ) aa:0 ac:0 id:0 il:0 ic:0 fr:2905 aa:299 ac:2605 id:159024 il:44346 ic:3297 fr:798 aa:2244 ac:4246 id:19228 il:5497 ic:266 fr:264 1*4kB 2*8kB 3*16kB 1*32kB 2*64kB 1*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB 2*4096kB = 11620kB) 1*4kB 1*8kB 0*16kB 2*32kB 3*64kB 7*128kB 4*256kB 2*512kB 0*1024kB 0*2048kB 0*4096kB = 3212kB) 12*4kB 11*8kB 6*16kB 2*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1256kB) Swap cache: add 0, delete 0, find 0/0, race 0+0 8851 pages of slabcache 142 pages of kernel stacks 0 lowmem pagetables, 211 highmem pagetables Free swap: 2104504kB 262144 pages of RAM 32768 pages of HIGHMEM 5001 reserved pages 190232 pages shared 0 pages swap cached iSCSI: 10 second timeout expired for session f7994000, rx 232994, ping 233493, now 233994 iSCSI: session f7994000 xmit_data failed to send 65584 bytes, rc 42152 iSCSI: session f7994000 to iqn.1992-08.com.netapp:sn.66666666.rolex1 dropped iSCSI: bus 0 target 2 trying to establish session f7994000 to portal 0, address 10.60.155.31 port 3260 group 1001 iSCSI: login phase for session f7994000 (rx 1803, tx 1802) timed out at 235500, timeout was set for 235500 iSCSI: session f7994000 connect timed out at 235502 iSCSI: session f7994000 to iqn.1992-08.com.netapp:sn.66666666.rolex1 waiting 1 seconds before next login attempt iSCSI: bus 0 target 2 trying to establish session f7994000 to portal 0, address 10.60.155.31 port 3260 group 1001 iSCSI: login phase for session f7994000 (rx 1803, tx 1802) timed out at 237107, timeout was set for 237107 iSCSI: session f7994000 connect timed out at 237109 iSCSI: session f7994000 to iqn.1992-08.com.netapp:sn.66666666.rolex1 waiting 1 seconds before next login attempt iSCSI: bus 0 target 2 trying to establish session f7994000 to portal 1, address 10.60.155.225 port 3260 group 1001 iSCSI: login phase for session f7994000 (rx 1803, tx 1802) timed out at 238714, timeout was set for 238714 iSCSI: session f7994000 connect timed out at 238716 iSCSI: session f7994000 to iqn.1992-08.com.netapp:sn.66666666.rolex1 waiting 1 seconds before next login attempt iSCSI: bus 0 target 2 trying to establish session f7994000 to portal 2, address 10.60.155.226 port 3260 group 1001 Neighbour table overflow. iSCSI: session f7994000 to iqn.1992-08.com.netapp:sn.66666666.rolex1 failed to connect, rc -105 iSCSI: session f7994000 connect failed at 238823 iSCSI: session f7994000 iSCSI: session f7994000 retrying all the portals again, since the portal list got exhausted iSCSI: session f7994000 to iqn.1992-08.com.netapp:sn.66666666.rolex1 waiting 1 seconds before next login attempt iSCSI: session f7994000 failing normal command f7b62a00 cdb 0x2a to (2 0 2 0) at 238903, retries 0, allowed 5 iSCSI: session f7994000 failing normal command f6f03800 cdb 0x2a to (2 0 2 0) at 238905, retries 0, allowed 5 iSCSI: session f7994000 failing normal command f7418200 cdb 0x2a to (2 0 2 0) at 238907, retries 0, allowed 5 iSCSI: session f7994000 failing normal command f7418400 cdb 0x2a to (2 0 2 0) at 238909, retries 0, allowed 5 iSCSI: session f7994000 failing normal command f6e6ec00 cdb 0x2a to (2 0 2 0) at 238911, retries 0, allowed 5 iSCSI: session f7994000 failing normal command f6e6e600 cdb 0x2a to (2 0 2 0) at 238913, retries 0, allowed 5 iSCSI: session f7994000 failing normal command f6e6ee00 cdb 0x2a to (2 0 2 0) at 238915, retries 0, allowed 5 iSCSI: session f7994000 failing normal command f6e6e800 cdb 0x2a to (2 0 2 0) at 238917, retries 0, allowed 5 iSCSI: session f7994000 failing normal command f6f03600 cdb 0x2a to (2 0 2 0) at 238919, retries 0, allowed 5 iSCSI: session f7994000 failing normal command f6e6ea00 cdb 0x2a to (2 0 2 0) at 238922, retries 0, allowed 5 iSCSI: session f7994000 failing normal command f6f03a00 cdb 0x2a to (2 0 2 0) at 238924, retries 0, allowed 5 iSCSI: session f7994000 failing normal command f6f03c00 cdb 0x2a to (2 0 2 0) at 238926, retries 0, allowed 5 iSCSI: bus 0 target 2 trying to establish session f7994000 to portal 0, address 10.60.155.31 port 3260 group 1001 I/O error: dev 08:21, sector 4530984 I/O error: dev 08:21, sector 4531240 I/O error: dev 08:21, sector 4531496 I/O error: dev 08:21, sector 4531752 I/O error: dev 08:21, sector 4532008 I/O error: dev 08:21, sector 4532264 I/O error: dev 08:21, sector 4532520 I/O error: dev 08:21, sector 4532776 I/O error: dev 08:21, sector 4533032 I/O error: dev 08:21, sector 4533280 I/O error: dev 08:21, sector 4533536 I/O error: dev 08:21, sector 4533792 iSCSI: bus 0 target 2 established session f7994000 #2, portal 0, address 10.60.155.31 port 3260 group 1001 telnet> send break iSCSI: 10 second timeout expired for session f7994000, rx 239006, ping 239505, now 240006 iSCSI: session f7994000 xmit_data failed to send 65584 bytes, rc 61792 iSCSI: session f7994000 to iqn.1992-08.com.netapp:sn.66666666.rolex1 dropped iSCSI: bus 0 target 2 trying to establish session f7994000 to portal 0, address 10.60.155.31 port 3260 group 1001 SysRq : Show Memory Mem-info: Zone:DMA freepages: 2905 min: 0 low: 0 high: 0 Zone:Normal freepages: 0 min: 767 low: 4032 high: 5792 Zone:HighMem freepages: 301 min: 255 low: 512 high: 768 Free pages: 3206 ( 301 HighMem) ( Active: 9456/168537, inactive_laundry: 58481, inactive_clean: 570, free: 3206 ) aa:0 ac:0 id:0 il:0 ic:0 fr:2905 aa:299 ac:2668 id:158222 il:44412 ic:0 fr:0 aa:2244 ac:4245 id:10315 il:14069 ic:570 fr:301 1*4kB 2*8kB 3*16kB 1*32kB 2*64kB 1*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB 2*4096kB = 11620kB) 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB) 1*4kB 0*8kB 9*16kB 3*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1204kB) Swap cache: add 0, delete 0, find 0/0, race 0+0 8785 pages of slabcache 142 pages of kernel stacks 0 lowmem pagetables, 211 highmem pagetables Free swap: 2104504kB 262144 pages of RAM 32768 pages of HIGHMEM 5001 reserved pages 194715 pages shared 0 pages swap cached iSCSI: login phase for session f7994000 (rx 1803, tx 1802) timed out at 241512, timeout was set for 241512 iSCSI: session f7994000 connect timed out at 241514 iSCSI: session f7994000 to iqn.1992-08.com.netapp:sn.66666666.rolex1 waiting 1 seconds before next login attempt iSCSI: bus 0 target 2 trying to establish session f7994000 to portal 0, address 10.60.155.31 port 3260 group 1001 iSCSI: session f7994000 recv_login_pdu: connection closed iSCSI: session f7994000 login I/O error, failed to receive a PDU iSCSI: session f7994000 retrying login to portal 1 at 242242 iSCSI: session f7994000 to iqn.1992-08.com.netapp:sn.66666666.rolex1 waiting 1 seconds before next login attempt iSCSI: bus 0 target 2 trying to establish session f7994000 to portal 1, address 10.60.155.225 port 3260 group 1001 iSCSI: session f7994000 recv_login_pdu: connection closed iSCSI: session f7994000 login I/O error, failed to receive a PDU iSCSI: session f7994000 retrying login to portal 2 at 242969 iSCSI: session f7994000 to iqn.1992-08.com.netapp:sn.66666666.rolex1 waiting 1 seconds before next login attempt iSCSI: bus 0 target 2 trying to establish session f7994000 to portal 2, address 10.60.155.226 port 3260 group 1001 Neighbour table overflow. iSCSI: session f7994000 to iqn.1992-08.com.netapp:sn.66666666.rolex1 failed to connect, rc -105 iSCSI: session f7994000 connect failed at 243076 iSCSI: session f7994000 iSCSI: session f7994000 retrying all the portals again, since the portal list got exhausted iSCSI: session f7994000 to iqn.1992-08.com.netapp:sn.66666666.rolex1 waiting 1 seconds before next login attempt iSCSI: bus 0 target 2 trying to establish session f7994000 to portal 0, address 10.60.155.31 port 3260 group 1001
One more thing I should have said before. The first messages I see on the console are not always the "10 second timeout expired" messages. Sometimes I initially see the following, indicating that iscsi is having difficulty getitng memory: tso login: iSCSI: session f7970000 to (2 0 3 *) couldn't allocate task at 50593, #alloc'd tasks=11437 iSCSI: session f7970000 to (2 0 3 *) couldn't allocate task at 50595, #alloc'd tasks=11437 iSCSI: session f7970000 to (2 0 3 *) done using preallocated task f6088f74 at 50597 iSCSI: session f7970000 to (2 0 3 *) couldn't allocate task at 50605, #alloc'd tasks=11484 iSCSI: session f7970000 to (2 0 3 *) couldn't allocate task at 50607, #alloc'd tasks=11484 I think the 10-second expired message is the result of the networking layer not getting memory, and thus no progress being made with iscsi. The above messages indicate iscsi task allocation failures. So I think the different messages just show who sees the memory pressure first - the root of the problem is the memory consumption is out of whack.
I'm also running the same test on rhas2.1, mem=1G, and it seems to be running fine. It runs a lot smoother. I don't see the periodic freezes either. I'm going to let the test run overnight though, just to be sure, but so far so good (test has been running for about 40 min's now). It's looking to me this is definately the vm in rhel3 that's causing the problem. Seems like iSCSI software wouldn't be the only thing affected by this vm change in rhel3. Wouldn't NFS suffer the same fate? Are there reports of lockups with NFS on rhel3?
Just so you know, I am exhausting any other option that I can think of to workaround this problem in either iscsi or the networking layer. But so far no luck.
Since this seems related to bounce pages, one possible way to alleviate this is to backport the sendpage code from the 2.6 based iscsi stuff into the 2.4 stuff. Since RHEL3 implements kmap_noblock() the "multiple kmap() can block" problem essentially disappears. This is just an unproven idea at this point.
*** Bug 150480 has been marked as a duplicate of this bug. ***
We have been testing several changes to the VM, but so far they have only resulted in minor improvements. This will not be fixed in time for U5. Regarding the suggestion in #21 about removing the need for bounce buffers, I agree that is a very desirable approach. As a caution, though, I recall this statement from the iSCSI mailing list some time ago: "When an HBA driver declares highmem_io, the midlayer assumes it is adriver for a PCI device and references a PCI structure associated withthe HBA. But in the ISCSI case, there is in fact no PCI device and themidlayer follows a wild pointer looking for the PCI structure." So, some work in the SCSI midlayer may be needed to make this possible.
Bryan's pci highmem comment is correct.
We can repro this now with different memory setups if we use more LUNs. I'll attach more info.
More specifically, we've seen a failure on an ia-64 machine (9GB RAM) when we run the 'dt' open source tool against 15-20 luns or so. Mike has given us some modified iscsi driver code that sets dma mask to see if we can avoid bounce buffers on ia-64, and this subsequent hang. I'll zip up the tarball with the various failure info, but here's some basic details of the test from our engineer: Data Test usage info: [root@ibmx450-rtp02 root]# dt DataTest V13.2 (1/22/00); SQE Build 23 dt was run using multiple 1GB luns (with partitions and ext3 filesystems of roughly the same size), and commands such as this one: dt of=/usr/tmp/nate_mnts/mnt_12398_ext3_0/nate_test limit=100m bs=128k procs=8 oncerr=abort disable=pstats errors=1 runtime =1h After multiple runs, I found that the ia-64 froze at least 50% of the time when I used 20 or more luns, and it still seemed to do so when I brought the lun count down to between 14 and 16. When I got down to 12 and 13 luns, the machine would typically stay up, but TCP/IP packets would start getting dropped consistently when the lun count was right around 13.
Created attachment 116429 [details] system info describing the hangs on i386 and ia-64 with 20+ luns dt test
Attachment I just created has hangs we've seen when running rhel3 u5 on both i386 and ia-64 with larger memory setups but a 'dt' test with 20 or more luns, as described above. Will soon attach the exact dt binaries used.
Created attachment 116431 [details] i386 dt version
Created attachment 116432 [details] ia-64 dt binary
There is a second issue on ia64 where it is using bounce buffers and using more mem than it should or needs to. I think I have a hack for that and made a seperate bugzilla here https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=162706.
Any progress on this for update7 on the other platforms (i386 in particular)? Thanks.
Removing block on bug 167215, which is against RHEL4.
There was some discussion of this problem in a different forum, where two questions were raised: to what extent is this problem limited to x86?, and can we expect to see an improvement going from RHEL 3 to RHEL 4? The following answers were provided by Larry Woodman. I'm copying them here to make the information available to a broader audience. Summary: in all practical situations the memory deadlock issue is limited to x86 (barring bugs on the other architectures, such as comment 31). The situation with x86 should improve in RHEL 4. ---------------- This problem is almost certainly limited to x86 because of the way that the kernel divides RAM into Lowmem and Highmem. The kernel divides RAM into two zones: Lowmem, or the first 896MB and Highmem, everything above 896MB. Since Lowmem is perminently mapped into the kernel's virtual address space and Highmem is not, all kernel data structures and DMA buffers that must be accessable via the device driver ISR must be allocated from Lowmem. Highmem, which in contrast is used to cache file system data(pagecache and buffermem) and to map into the anonymous virtual reagions of user address spaces must be temporarily mapped a page at a time into the kernel's virtual address space before it can be accessed by a devnce driver. When a subsystem like iscsi is used, the data buffers must be allocated out of Lowmem so that they are: 1.) DMA-able so the NIC can DMA into these buffers and 2.) mapped into the kernel's virtual address space so that the device driver's ISR can look at the data packet. Further more, when an x86 system exhausts memory and kswapd must reclaim those pages of memory, the Highmem pages must be copied or bounced to Lowmem pages before they can be passed down to the network layer. This bouncing from Highmem to Lowmem in conjunction with the need to allocate all kernel data structures from Lowmem, combined with the realitively small, limited Lowmem size applies much more pressure on the Lowmem region of RAM than the Highmem region of RAM. This pressure can totally exhaust Lowmem long before Highmem is exhausted on an x86 system and therefore lead to deadlocks. Since the 64 bit architectures have a very large kernel virtual address space, it can map all of RAM into one virtually contiguous kernel address space and therefore totally eliminate the need to divide memory into Lowmem and Highmem. Since all RAM is mapped into the kernel address space, kernel data structures and DMA buffers that must be accessable via the device driver ISR can be allocated from any page or RAM. This will prevent any and all uneven memory pressure on 64 bit systems and therefore eliminate the possibility of deadlocking the system when using iscsi. One of the key differences between the RHEL3/2.4 kernel and the RHEL4/2.6 kernel is failure paths in the pageout code. In the 2.4 kernel once the pageout code determines that it is committing a page to be written out to disk it must go. There are no failure return codes therefore the actual device routine must succeed. This might mean spinning in a tight loop waiting for some resource such as a buffer or some other pecee of memory to become available. This is contrast to the 2.6 kernel where the paeout code has actual failure return codes. If for some reason the pageout can not or will not take place at properly the page is re-marked as dirty and placed back on the appropriate list. This fundimental differenced makes the likelyhood of deadlocking in the VM code much lower in RHEL4 than RHEL3.