Bug 205722
Summary: | lockup in shrink_zone when node out of memory | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Issue Tracker <tao> | ||||||
Component: | kernel | Assignee: | Larry Woodman <lwoodman> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 4.0 | CC: | clalance, ddomingo, herbert.van.den.bergh, jbaron, jplans, jrfuller, nobody+bjmason, poelstra, tao | ||||||
Target Milestone: | --- | Keywords: | OtherQA, ZStream | ||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | RHBA-2007-0791 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2007-11-15 16:15:09 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 245197 | ||||||||
Bug Blocks: | 234251, 238901, 238902, 238904, 238905, 245198, 246621, 248141, 248673, 435662 | ||||||||
Attachments: |
|
Description
Issue Tracker
2006-09-08 08:30:33 UTC
We are seeing many occurences of the following lockup when nodes run out of memory without swap. NMI Watchdog detected LOCKUP, CPU=0, registers: CPU 0 Modules linked in: perfctr(U) netdump(U) job(U) i2c_dev(U) i2c_core(U) ib_ipoib(U) rdma_ucm(U) rdma_cm(U) ib_addr(U) ib_mthca(U) ib_umad(U) ib_ucm(U) ib_uverbs(U) ib_cm(U) ib_sa(U) ib_mad(U) ib_core(U) dm_mod(U) sd_mod(U) usb_storage(U) joydev(U) rtc(U) md(U) ohci_hcd(U) k8_edac(U) edac_mc(U) floppy(U) sata_nv(U) libata(U) scsi_mod(U) unionfs(U) nfs(U) lockd(U) sunrpc(U) e1000(U) Pid: 25914, comm: lamp_DD Not tainted 2.6.9-50chaos RIP: 0010:[<ffffffff802ed635>] <ffffffff802ed635>{.text.lock.spinlock+46} RSP: 0018:00000103d25a18d8 EFLAGS: 00000086 RAX: 0000000000000000 RBX: 0000010300000800 RCX: 0000010300000808 RDX: 0000010300f245f0 RSI: 000000000000000e RDI: 0000010300000800 RBP: 0000000000000000 R08: 00000103d25a0000 R09: 0000000300000000 R10: 0000000300000000 R11: 0000000000000000 R12: 0000010300000780 R13: 00000103d25a1ab8 R14: 00000103d25a1b38 R15: 00000103d25a1bd8 FS: 0000000040200960(005b) GS:ffffffff804e2080(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000002a956fb000 CR3: 0000000000101000 CR4: 00000000000006e0 Process lamp_DD (pid: 25914, threadinfo 00000103d25a0000, task 00000101ff647550) Stack: 0000010300000800 ffffffff80161370 409cf80000000000 0000000000000000 ffffffe000000020 0000000000000020 0000000000000000 000000000004ea5b 409cf80000000000 0000000000000246 Call Trace:<ffffffff80161370>{shrink_zone+1456} <ffffffff8012eba5>{move_tasks+406} <ffffffff8014732c>{keventd_create_kthread+0} <ffffffff802ec3df>{thread_return+0} <ffffffff802ec437>{thread_return+88} <ffffffff80131396>{autoremove_wake_function+0} <ffffffff80162047>{try_to_free_pages+318} <ffffffff8015a69b>{__alloc_pages+545} <ffffffff8015d3c0>{do_page_cache_readahead+209} <ffffffff80157278>{filemap_nopage+338} <ffffffff80165e80>{do_no_page+1045} <ffffffff8016622a>{handle_mm_fault+343} <ffffffff8011fb99>{do_page_fault+545} <ffffffff80131396>{autoremove_wake_function+0} <ffffffff8018fbc8>{dnotify_parent+34} <ffffffff80175c20>{vfs_read+252} <ffffffff8010fd89>{error_exit+0} Code: 83 3b 00 7e f9 e9 91 fd ff ff f3 90 83 3b 00 7e f9 e9 cf fd Kernel panic - not syncing: nmi watchdog ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at panic:74 invalid operand: 0000 [1] SMP CPU 0 Modules linked in: perfctr(U) netdump(U) job(U) i2c_dev(U) i2c_core(U) ib_ipoib(U) rdma_ucm(U) rdma_cm(U) ib_addr(U) ib_mthca(U) ib_umad(U) ib_ucm(U) ib_uverbs(U) ib_cm(U) ib_sa(U) ib_mad(U) ib_core(U) dm_mod(U) sd_mod(U) usb_storage(U) joydev(U) rtc(U) md(U) ohci_hcd(U) k8_edac(U) edac_mc(U) floppy(U) sata_nv(U) libata(U) scsi_mod(U) unionfs(U) nfs(U) lockd(U) sunrpc(U) e1000(U) Pid: 25914, comm: lamp_DD Not tainted 2.6.9-50chaos RIP: 0010:[<ffffffff801336f2>] <ffffffff801336f2>{panic+211} RSP: 0018:ffffffff80471ea8 EFLAGS: 00010082 RAX: 000000000000002c RBX: ffffffff80301543 RCX: 0000000000000046 RDX: 0000000000008378 RSI: 0000000000000046 RDI: ffffffff803b9c60 RBP: ffffffff80472058 R08: 0000000000000007 R09: ffffffff80301543 R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000003d R13: 00000103d25a1ab8 R14: 00000103d25a1b38 R15: 00000103d25a1bd8 FS: 0000000040200960(005b) GS:ffffffff804e2080(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000002a956fb000 CR3: 0000000000101000 CR4: 00000000000006e0 Process lamp_DD (pid: 25914, threadinfo 00000103d25a0000, task 00000101ff647550) Stack: 0000003000000008 ffffffff80471f88 ffffffff80471ec8 0000000000000013 0000000000000000 0000000000000046 000000000000834c 0000000000000046 0000000000000007 ffffffff80303a98 Call Trace:<ffffffff80110858>{show_stack+241} <ffffffff80110982>{show_registers+277} <ffffffff80110c89>{die_nmi+130} <ffffffff8011a58d>{nmi_watchdog_tick+210} <ffffffff80111560>{default_do_nmi+122} <ffffffff8011a643>{do_nmi+115} <ffffffff8011016b>{paranoid_exit+0} <ffffffff802ed635>{.text.lock.spinlock+46} <EOE> <ffffffff80161370>{shrink_zone+1456} <ffffffff8012eba5>{move_tasks+406} <ffffffff8014732c>{keventd_create_kthread+0} <ffffffff802ec3df>{thread_return+0} <ffffffff802ec437>{thread_return+88} <ffffffff80131396>{autoremove_wake_function+0} <ffffffff80162047>{try_to_free_pages+318} <ffffffff8015a69b>{__alloc_pages+545} <ffffffff8015d3c0>{do_page_cache_readahead+209} <ffffffff80157278>{filemap_nopage+338} <ffffffff80165e80>{do_no_page+1045} <ffffffff8016622a>{handle_mm_fault+343} <ffffffff8011fb99>{do_page_fault+545} <ffffffff80131396>{autoremove_wake_function+0} <ffffffff8018fbc8>{dnotify_parent+34} <ffffffff80175c20>{vfs_read+252} <ffffffff8010fd89>{error_exit+0} Code: 0f 0b 9f 1a 30 80 ff ff ff ff 4a 00 31 ff e8 57 bf fe ff e8 RIP <ffffffff801336f2>{panic+211} RSP <ffffffff80471ea8> crash> bt PID: 25914 TASK: 101ff647550 CPU: 0 COMMAND: "lamp_DD" #0 [ffffffff80471ce0] netpoll_start_netdump at ffffffffa02213b7 #1 [ffffffff80471d10] die at ffffffff80110bf8 #2 [ffffffff80471d30] do_invalid_op at ffffffff80110fc0 #3 [ffffffff80471d68] panic at ffffffff801336f2 #4 [ffffffff80471d70] release_console_sem at ffffffff8013401e #5 [ffffffff80471d90] vprintk at ffffffff8013424c #6 [ffffffff80471dc0] printk at ffffffff801342f6 #7 [ffffffff80471df0] error_exit at ffffffff8010fd89 [exception RIP: panic+211] RIP: ffffffff801336f2 RSP: ffffffff80471ea8 RFLAGS: 00010082 RAX: 000000000000002c RBX: ffffffff80301543 RCX: 0000000000000046 RDX: 0000000000008378 RSI: 0000000000000046 RDI: ffffffff803b9c60 RBP: ffffffff80472058 R8: 0000000000000007 R9: ffffffff80301543 R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000003d R13: 00000103d25a1ab8 R14: 00000103d25a1b38 R15: 00000103d25a1bd8 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #8 [ffffffff80471ea0] panic at ffffffff801336de #9 [ffffffff80471f20] show_stack at ffffffff80110858 #10 [ffffffff80471f50] show_registers at ffffffff80110982 #11 [ffffffff80471f80] die_nmi at ffffffff80110c89 #12 [ffffffff80471fa0] nmi_watchdog_tick at ffffffff8011a58d #13 [ffffffff80471fe0] default_do_nmi at ffffffff80111560 #14 [ffffffff80472040] do_nmi at ffffffff8011a643 [exception RIP: .text.lock.spinlock+46] RIP: ffffffff802ed635 RSP: 00000103d25a18d8 RFLAGS: 00000086 RAX: 0000000000000000 RBX: 0000010300000800 RCX: 0000010300000808 RDX: 0000010300f245f0 RSI: 000000000000000e RDI: 0000010300000800 RBP: 0000000000000000 R8: 00000103d25a0000 R9: 0000000300000000 R10: 0000000300000000 R11: 0000000000000000 R12: 0000010300000780 R13: 00000103d25a1ab8 R14: 00000103d25a1b38 R15: 00000103d25a1bd8 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <exception stack> --- #15 [103d25a18d8] .text.lock.spinlock at ffffffff802ed635 #16 [103d25a18e0] shrink_zone at ffffffff80161370 #17 [103d25a1940] move_tasks at ffffffff8012eba5 #18 [103d25a19d0] thread_return at ffffffff802ec437 #19 [103d25a1ba0] try_to_free_pages at ffffffff80162047 #20 [103d25a1c50] __alloc_pages at ffffffff8015a69b #21 [103d25a1cc0] do_page_cache_readahead at ffffffff8015d3c0 #22 [103d25a1d30] filemap_nopage at ffffffff80157278 #23 [103d25a1d90] do_no_page at ffffffff80165e80 #24 [103d25a1df0] handle_mm_fault at ffffffff8016622a #25 [103d25a1e70] do_page_fault at ffffffff8011fb99 #26 [103d25a1ee0] dnotify_parent at ffffffff8018fbc8 #27 [103d25a1f10] vfs_read at ffffffff80175c20 #28 [103d25a1f50] error_exit at ffffffff8010fd89 RIP: 0000002a96148d4c RSP: 00000000401fd1e0 RFLAGS: 00010202 RAX: 00000000fbad8004 RBX: 0000002a95936570 RCX: 00000000402000b0 RDX: 0000000040200090 RSI: 00000000401fd3f8 RDI: 0000002a95936570 RBP: 00000000401fd3f8 R8: 00000000401fd9c0 R9: 0000000000000000 R10: 0000000040200101 R11: 0000002a96137e50 R12: 0000002a9632b080 R13: 0000000000000000 R14: 0000000000000000 R15: 00000000401fd8c0 ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b crash> kmem -i PAGES TOTAL PERCENTAGE TOTAL MEM 3913698 14.9 GB ---- FREE 6998 27.3 MB 0% of TOTAL MEM USED 3906700 14.9 GB 99% of TOTAL MEM SHARED 0 0 0% of TOTAL MEM BUFFERS 0 0 0% of TOTAL MEM CACHED 22763 88.9 MB 0% of TOTAL MEM SLAB 0 0 0% of TOTAL MEM TOTAL HIGH 0 0 0% of TOTAL MEM FREE HIGH 0 0 0% of TOTAL HIGH TOTAL LOW 3913698 14.9 GB 100% of TOTAL MEM FREE LOW 6998 27.3 MB 0% of TOTAL LOW TOTAL SWAP 0 0 ---- SWAP USED 0 0 100% of TOTAL SWAP SWAP FREE 0 0 0% of TOTAL SWAP This event sent from IssueTracker by nmurray [SEG - Kernel] issue 101232 Sorry, this is on x86_64, 4 socket, dual-core, 16GB RAM. We have lower_zone_protection = 100 I will see about getting you a vmcore This event sent from IssueTracker by nmurray [SEG - Kernel] issue 101232 usemem doesn't seem to trigger. Maybe IB and locking of user space pages in kernel space must be involved. The bz:193695 patch is applied. This event sent from IssueTracker by nmurray [SEG - Kernel] issue 101232 10:55 <neb> So usemem-mpi causes the problem reliably? 10:56 <neb> So I can see what you are saying there are no spinlocks in shrink_zone 10:56 <neb> however it does call out to refill_inactive_zone, shrink_cache and throttle_vm_writeout. 10:57 <neb> I don't know why we wouldn't see the one holding the lock in the bt though. 10:57 <grondo> yeah, it must be one of those, not sure why they aren't in the stack trace 10:57 <neb> I suspect that it is NOT throttle_vm_writeout() 10:58 <neb> Inlining? 10:59 <neb> with the usemem code how much memory is pinned? 10:59 <grondo> ah, yeah are those inlined? 10:59 <neb> it doesn't look like it explictly but who knows what evilness the compiler does in the end. 11:01 <neb> shrink_cache just recursively calls shrink_zone 11:04 <neb> refill_inactive_zone has spinlocks 11:06 <neb> 674 spin_lock_irq(&zone->lru_lock); 11:07 <neb> so the hypothesis is that for whatever allocation is trying to be done (and you have provided 3 exampls) 11:09 <neb> The number of active pages are so large that the process of scanning the pages takes longer than the NMI allows. 11:10 <neb> So to the NMI watchdog it looks as though the program is stuck in the while look started in 625 11:10 <neb> That would also explain the lost timer ticks since it is spin_lock_irq 11:11 <neb> I would say that a reasonable test might be to set the watchdog timer up to a higher value. 11:11 <neb> or to remove RAM from the machine and see if it works. 11:12 <neb> The RAM removal might reduce the time needed to scan all of the pages below the threshold for the watchdog This event sent from IssueTracker by nmurray [SEG - Kernel] issue 101232 16:08 <grondo> ok, finally running my test with NMI timeout of 60s 16:09 <grondo> well, you were right 16:10 <grondo> OOM killer went off as expected and node eventually recovered, though it was out to lunch for at least 45-90s 16:10 <grondo> So that is a big clue 16:10 <grondo> Thanks for the suggestion 16:12 <neb> cool. 16:12 <neb> Hmm...I'll try to get more info as to what to do about it. This event sent from IssueTracker by nmurray [SEG - Kernel] issue 101232 Kent, Can we get some more serious VM people to weigh in as to what to do about this. I feel as though the implications of various approaches to solving this issue would escape me and therefore it would be most beneficial to draw on those people in RH who have more experience in how to work around issues in the VM. The problem seems to be that when memory is tight, an other or other CPUs hang on the zone->lru_lock and then nmi watchdog kicks off. The contributing factors seem to be: 1) no-swap 2) heavily SMP machine (8 cores) The 2nd or 3rd or 6th person waiting on the lock might have to wait a very long time. 3) very active memory utilization 4) 16GiB of mem i.e. 4Mi pages so the time it takes to scan for active pages takes a while. 5) only 5 secs before NMI watchdog goes off 6) x86_64 i.e. everything is in one zone There might be something to do with pinned memory and the IB interface but that could also just be that all that pinned memory makes it difficult to find a page that be put on the inactive list. A piece of information that doesn't appear in the case log to this point is that there were messages in the logs for the crashed nodes about missing timer interrupts and drivers hogging the CPU with interrupts off. I don't have the exact messages handy. This is consistent with the fact that the lock is taken with interrupts disabled. One data point that doesn't jive with my hypothesis is the fact that the dumps show refill_inactive_zone in their backtraces. Why would that be? The only thing that I could think of was that the compiler decided to inline it even though it wasn't explicitly listed as an inline function. This event sent from IssueTracker by nmurray [SEG - Kernel] issue 101232 I'm not going to have a lot of information on this other than what's been presented. LLNL has already done most of the research and narrowing down of what they think is going on based on the analysis of their vmcores. At this point, they'd like to get a resident VM expert to have a look and weigh in on it. I don't have a reproducer readily available...this was reproduced in an HPC environment using MPI and infiniband hardware on x86-64 compute nodes with 8 AMD cores and 16GB of RAM each. If there's any additional information that you guys need let me or Ben know. Issue escalated to Support Engineering Group by: kbaxley. Internal Status set to 'Waiting on SEG' Status set to: Waiting on Tech This event sent from IssueTracker by nmurray [SEG - Kernel] issue 101232 I was able to reproduce this problem by adding mlockall (MCL_FUTURE); at the beginning of usemem.c. So the pinning of memory seems to more important than anything else related to IB. Kent, if you'd like I could attach the new usemem.c, but the addition is trivial. mark This event sent from IssueTracker by nmurray [SEG - Kernel] issue 101232 I agree it's unlikely to be IB related (unless IB is doing some mlock'ing) - the mlockall is definitely a stress on the vm as those pages are going to be completely unreclaimable by the vm. There's another IT with an uncannily similar footprint - 99166. Since raising the nmi timeout had positive impact, have they yet tested lowering the amount of memory in the system? Do we have any idea what percentage of memory (or memory per node) is locked when we're having a problem? This event sent from IssueTracker by nmurray [SEG - Kernel] issue 101232 Hi, Mark I haven't been able to reproduce it with the modified usemem.c yet. Apparently you need a system with a lot of memory in it to get it to trigger. I tried this on a quad, dual-core opteron system with 8GB of RAM (which is about the largest system I have access to at this time). The OOM killer launched and the system recovered. I did, however, escalate it to engineering yesterday afternoon. Here's the response I got back from them...it looks like another customer out there (Fujitsu) is hitting something similar (questions and comments below): We agree that the problem is unlikely to be IB related (unless IB is doing some mlock'ing) - the mlockall is definitely a stress on the vm as those pages are going to be completely unreclaimable by the vm. We've got another customer at Fujitsu with an uncannily similar problem. Since raising the nmi timeout had positive impact, have they yet tested lowering the amount of memory in the system? Does LLNL have any idea what percentage of memory (or memory per node) is locked when they're having a problem? This event sent from IssueTracker by nmurray [SEG - Kernel] issue 101232 In the mlockall() case, definitely most of the memory on each node is pinned. I forgot to say that in my tests, I ran 1 copy of "usemem" per core on the system, so on a quad socket/dual core node I was running 8 copies of usemem. I don't know if running just one copy will trigger the problem. MPI jobs on Infiniband definitely need to lock large regions of memory. Any memory that might be used for communications cannot be swappable, so the IB drivers or MPI implementation locks these pages. I'm not sure how much of the application's memory is locked in this case, however. At this point, it might be everything that is malloc'd, I'm just not sure. I will try lowering the amount of RAM in the machine. I'm confident that at some point this will relieve the problem. This event sent from IssueTracker by nmurray [SEG - Kernel] issue 101232 A very similar looking footprint has also been reported by Fujitsu Siemens - though without the use of mlock'd memory. Tentatively linking that IT here as well and wanting to get some conversation going with engineering on this as the similarity seems to be multiprocessor, multicore systems running memory stress tests. Can someone grab a quick AltSysrq-M outout so I can see the size of the memory nodes. Larry Woodman Working on getting that for you Larry. However, as a preliminary note the nodes are 4 socket dual core chips and they have 16GiB RAM spread evenly amongst the nodes. Thus it should be 4GiB/node. -ben This event sent from IssueTracker by woodard issue 101232 2006-09-11 11:08:48 Mem-info: 2006-09-11 11:08:48 Node 3 DMA per-cpu: empty 2006-09-11 11:08:48 Node 3 Normal per-cpu: 2006-09-11 11:08:48 cpu 0 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 0 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 cpu 1 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 1 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 cpu 2 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 2 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 cpu 3 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 3 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 cpu 4 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 4 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 cpu 5 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 5 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 cpu 6 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 6 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 cpu 7 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 7 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 Node 3 HighMem per-cpu: empty 2006-09-11 11:08:48 Node 2 DMA per-cpu: empty 2006-09-11 11:08:48 Node 2 Normal per-cpu: 2006-09-11 11:08:48 cpu 0 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 0 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 cpu 1 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 1 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 cpu 2 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 2 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 cpu 3 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 3 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 cpu 4 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 4 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 cpu 5 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 5 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 cpu 6 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 6 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 cpu 7 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 7 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 Node 2 HighMem per-cpu: empty 2006-09-11 11:08:48 Node 1 DMA per-cpu: empty 2006-09-11 11:08:48 Node 1 Normal per-cpu: 2006-09-11 11:08:48 cpu 0 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 0 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 cpu 1 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 1 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 cpu 2 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 2 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 cpu 3 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 3 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 cpu 4 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 4 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 cpu 5 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 5 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 cpu 6 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 6 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 cpu 7 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 7 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 Node 1 HighMem per-cpu: empty 2006-09-11 11:08:48 Node 0 DMA per-cpu: 2006-09-11 11:08:48 cpu 0 hot: low 2, high 6, batch 1 2006-09-11 11:08:48 cpu 0 cold: low 0, high 2, batch 1 2006-09-11 11:08:48 cpu 1 hot: low 2, high 6, batch 1 2006-09-11 11:08:48 cpu 1 cold: low 0, high 2, batch 1 2006-09-11 11:08:48 cpu 2 hot: low 2, high 6, batch 1 2006-09-11 11:08:48 cpu 2 cold: low 0, high 2, batch 1 2006-09-11 11:08:48 cpu 3 hot: low 2, high 6, batch 1 2006-09-11 11:08:48 cpu 3 cold: low 0, high 2, batch 1 2006-09-11 11:08:48 cpu 4 hot: low 2, high 6, batch 1 2006-09-11 11:08:48 cpu 4 cold: low 0, high 2, batch 1 2006-09-11 11:08:48 cpu 5 hot: low 2, high 6, batch 1 2006-09-11 11:08:48 cpu 5 cold: low 0, high 2, batch 1 2006-09-11 11:08:48 cpu 6 hot: low 2, high 6, batch 1 2006-09-11 11:08:48 cpu 6 cold: low 0, high 2, batch 1 2006-09-11 11:08:48 cpu 7 hot: low 2, high 6, batch 1 2006-09-11 11:08:48 cpu 7 cold: low 0, high 2, batch 1 2006-09-11 11:08:48 Node 0 Normal per-cpu: 2006-09-11 11:08:48 cpu 0 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 0 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 cpu 1 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 1 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 cpu 2 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 2 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 cpu 3 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 3 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 cpu 4 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 4 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 cpu 5 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 5 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 cpu 6 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 6 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 cpu 7 hot: low 32, high 96, batch 16 2006-09-11 11:08:48 cpu 7 cold: low 0, high 32, batch 16 2006-09-11 11:08:48 Node 0 HighMem per-cpu: empty 2006-09-11 11:08:48 2006-09-11 11:08:48 Free pages: 15511408kB (0kB HighMem) 2006-09-11 11:08:48 Active:13026 inactive:9053 dirty:0 writeback:0 unstable:0 free:3877852 slab:6602 mapped:3965 pagetables:179 2006-09-11 11:08:48 Node 3 DMA free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no 2006-09-11 11:08:48 protections[]: 0 131000 131000 2006-09-11 11:08:48 Node 3 Normal free:4097208kB min:4192kB low:5240kB high:6288kB active:2924kB inactive:4096kB present:4194300kB pages_scanned:0 all_unreclaimable? no 2006-09-11 11:08:48 protections[]: 0 0 0 2006-09-11 11:08:48 Node 3 HighMem free:0kB min:128kB low:160kB high:192kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no 2006-09-11 11:08:48 protections[]: 0 0 0 2006-09-11 11:08:48 Node 2 DMA free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no 2006-09-11 11:08:48 protections[]: 0 131000 131000 2006-09-11 11:08:48 Node 2 Normal free:4088368kB min:4192kB low:5240kB high:6288kB active:21264kB inactive:11972kB present:4194300kB pages_scanned:0 all_unreclaimable? no 2006-09-11 11:08:48 protections[]: 0 0 0 2006-09-11 11:08:48 Node 2 HighMem free:0kB min:128kB low:160kB high:192kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no 2006-09-11 11:08:48 protections[]: 0 0 0 2006-09-11 11:08:48 Node 1 DMA free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no 2006-09-11 11:08:48 protections[]: 0 131000 131000 2006-09-11 11:08:48 Node 1 Normal free:4108168kB min:4192kB low:5240kB high:6288kB active:6212kB inactive:4632kB present:4194300kB pages_scanned:0 all_unreclaimable? no 2006-09-11 11:08:48 protections[]: 0 0 0 2006-09-11 11:08:48 Node 1 HighMem free:0kB min:128kB low:160kB high:192kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no 2006-09-11 11:08:48 protections[]: 0 0 0 2006-09-11 11:08:48 Node 0 DMA free:11880kB min:16kB low:20kB high:24kB active:0kB inactive:0kB present:16384kB pages_scanned:0 all_unreclaimable? no 2006-09-11 11:08:48 protections[]: 0 105300 105300 2006-09-11 11:08:48 Node 0 Normal free:3205784kB min:3372kB low:4212kB high:5056kB active:21704kB inactive:15512kB present:3375100kB pages_scanned:0 all_unreclaimable? no 2006-09-11 11:08:48 protections[]: 0 0 0 2006-09-11 11:08:48 Node 0 HighMem free:0kB min:128kB low:160kB high:192kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no 2006-09-11 11:08:48 protections[]: 0 0 0 2006-09-11 11:08:48 Node 3 DMA: empty 2006-09-11 11:08:48 Node 3 Normal: 140*4kB 63*8kB 9*16kB 4*32kB 4*64kB 1*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 999*4096kB = 4097208kB 2006-09-11 11:08:48 Node 3 HighMem: empty 2006-09-11 11:08:48 Node 2 DMA: empty 2006-09-11 11:08:48 Node 2 Normal: 164*4kB 40*8kB 6*16kB 4*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 997*4096kB = 4088368kB 2006-09-11 11:08:48 Node 2 HighMem: empty 2006-09-11 11:08:48 Node 1 DMA: empty 2006-09-11 11:08:48 Node 1 Normal: 152*4kB 85*8kB 30*16kB 9*32kB 2*64kB 2*128kB 4*256kB 9*512kB 6*1024kB 1*2048kB 999*4096kB = 4108168kB 2006-09-11 11:08:48 Node 1 HighMem: empty 2006-09-11 11:08:48 Node 0 DMA: 4*4kB 7*8kB 4*16kB 3*32kB 2*64kB 2*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB 2*4096kB = 11880kB 2006-09-11 11:08:48 Node 0 Normal: 0*4kB 3*8kB 0*16kB 12*32kB 8*64kB 4*128kB 3*256kB 1*512kB 0*1024kB 0*2048kB 782*4096kB = 3205784kB 2006-09-11 11:08:48 Node 0 HighMem: empty 2006-09-11 11:08:48 Swap cache: add 0, delete 0, find 0/0, race 0+0 2006-09-11 11:08:48 Free swap: 0kB 2006-09-11 11:08:48 3993596 pages of RAM 2006-09-11 11:08:48 79899 reserved pages 2006-09-11 11:08:48 4704 pages shared 2006-09-11 11:08:48 0 pages swap cached This event sent from IssueTracker by woodard issue 101232 Hi, Larry LLNL reports that running with numa=off doesn't resolve the issue. However, it appears they're getting much better behavior when they bind each task to a CPU and force memory affinity to the local node with libnuma. The OOM killer goes off instead of the NMI watchdog. So,they're confused, then, as to why numa=off didn't help. This event sent from IssueTracker by kbaxley issue 101232 Created attachment 136206 [details]
Bootlog and crash messages
Hi there,
x-posted from IT#99166
FSC will try numa=off, but would like to know if there is a HOWTO or some docs
on the NUMA API, and if we can elaborate on where the locked memory regions
come from.
They have tried the smp kernel, runlevel 3 and cpuspeed switched off and it
still caused a panic again, which is what the attachement is about.
/A
Larry, one thing that we did do was not remove nmi-watchdog but extend the timeout. The system does get unresponsive for a while but it does oomkill in the end. Mark is working on getting the Sysrq-W's for you. However, I don't completely follow your logic. It seems to me, that if multiple CPUs start hanging on zone->lru_lock then we could easily exceed the #of seconds for the watchdog. You are scanning through 16GB of pages that have been intentionally kept active. 14:55 <grondo> fyi: binding to local memory doesn't work everytime, sometimes I still get an NMI timeout 14:55 <grondo> but it does seem to allieviate the hang somewhat This event sent from IssueTracker by woodard issue 101232 Larry, When you tried to reproduce the problem were you doing it on a machine with at least 8CPUs and at least 16GB RAM? Mark, was just able to reproduce the problem using his usemem.c with the addition of a mlockall() at the beginning of the code on a stock kernel, 2.6.9-42.ELsmp. I strongly believe that the crux of this problem is the high CPU count and the relatively large amount of mem. Mark and I have been looking at the problem and your comments and we have been discussing it. We believe that you may be looking for a REAL deadlock. We don't believe that is what is happening. We don't believe that a processor is truly getting stuck. We believe that the problem is a slow race between the watchdog timer and the lru_lock. If a processor doesn't hit a code path that does a touch_nmi_watchdog() or nmi_watchdog_tick() before the 5 seconds is up, *poof*. When we first started looking at this problem, we were able to work around it by increasing the length of the watchdog timer to 60 sec and then things behaved as they should. IMHO this is strong evidence that it isn't a case where a CPU is actually stuck, it is a case where the CPU is not getting around to touching the watchdog timer before it goes off. As an experiment, we are adding a touch_nmi_watchdog() after the lru_lock is dropped. We hypothesize that this will give an indication that processors are moving through this code path but they aren't getting around to the petting that the nmi_watchdog needs to keep from going off. Mark is running that test now. This event sent from IssueTracker by woodard issue 101232 Larry, Mark added some touch_nmi_watchdog() calls to the refill_inactive_list AFTER the lru_lock was dropped. It seems to me that this suggests that my hypothesis that we are moving through the critical section with the lru_lock but failing to get to a place where we the watchdog is updated is at least partially valid. I hope that this is useful clue for you and that it doesn't get you chasing a wild goose dreamed up by a couple of 3rd rate kernel hackers. ;-) 16:32 <grondo> Those touch_nmi_watchdog() calls seem to have helped. We get OOM kills instead of NMI timeout 16:32 <grondo> however there was other collateral damage 16:32 <grondo> TCP: time wait bucket table overflow 16:32 <grondo> got a bunch of those on the console 16:33 <grondo> Tomorrow perhaps I'll try removing touch_nmi_watchdog() calls and see at what point NMI watchdog detects timeout again 16:35 <neb> very cool. 16:36 <neb> It definitely does suggest that we are on the right track at least. 16:36 <grondo> system was hung for about 45s at least then came back 16:36 <grondo> yes This event sent from IssueTracker by woodard issue 101232 larry, OK I see what you are saying. The sc->nr_to_scan gets set to the min of two numbers of which one of them is SWAP_CLUSTER_MAX which is 32. Got it. However unlikely, do you hve another explanation for: 1) increasing the watchdog timer allows the machine to recover 2) touching the watchdog after dropping the lock allows the machine to recover. Maybe I'm being too theoretical here bere but if the problem was a failure to drop lock somewhere wouldn't that lead to a real deadlock from which the machine wouldn't recover. If that were not the case then you would have to be doing a double decrement on the lock somewhere and I thought that was a no-no. In my mind the fact that we are have denonstrated that CPUs move through the refill_inactive_list() and do drop the lock seems like a pretty compelling argument that somewhere, with all the CPUs contending for the lock and all the iterations through the critical section until enough mem is freed, the watchdog timer is not getting updated. As unlikely as it may be, I don't have a better explanation that satisfies the two items above. What I'm trying to figure out is where the the nearest place that the watchdog timer actually gets poked is. Once Mark is able to capture an alt-sysrq-W then I will be interested in looking through the various backtraces to see how many levels must be peeled away before I get to a function that pokes the watchdog. I suspect that the sysrq-w will show that all cpus are heavily involved in trying to reclaim memory and several of them happen to be hanging on the lru_lock. I'm sorry, but I think that we my have forgotton to point out that these are diskless, swapless nodes. That may be the key factor that has prevented you from reproducing this problem. We seem to have no problem reproducing it here. This event sent from IssueTracker by woodard issue 101232 For the moment setting the overcommit policy to 2 and tuning the overcommit_ratio up higher seems like it might be a satisfactory solution. A key part of this solution was understanding why this is a good potential solution to the problem. If we had proposed setting the overcommit policy earlier, I doubt it would have been accepted. They are going to do additional testing to see if this will work with thier workloads. There is some concern that it will not. They ask that the IT tickit stay open until more testing is done. If they run into further problems, then they will use this case to introduce the problem. As a general rule, we may want to document that overcommit policy needs to be changed from the default on swapless nodes. (we may have dodged a bullet on this one guys -- no scary kernel change. Keep your fingers crossed) This event sent from IssueTracker by woodard issue 101232 Hi, FSC have tried with numa=off, but still experience crashes albeit with different crash messages. I have enquired about access to hardware for our engineering team to aid in debugging this, still waiting to hear back on that. FSC have tried this with, and without, swap - no difference. I have suggested they try the overcommit policy trick, and also suggested about increasing the watchdog timer. I hope to hear back soon about that from them. Kind Regards, Anders Karlsson This event sent from IssueTracker by akarlsso issue 99166 FYI Ben, BZ 200885 is another case in which disabling the nmi watchdog timer via adding "nmi_watchdog=0" to the bootline prevents an 8-way Opteron from crashing due to lock starvation. Larry Woodman It seems like the new test kernel is going into hard lockup to the point that sysrq's don't seem to be responding so gathering information is difficult power control seems to be working, just couldn't get any sysrq's. The watchdog is the only way source of info. It might be a deadlock with irq disabled I guess. It looks like it stuck in a spinlock refill_inactive_zone() 15:03 <grondo> --- <exception stack> --- 15:03 <grondo> #15 [10208b39938] .text.lock.spinlock at ffffffff802edd69 15:03 <grondo> #16 [10208b39940] refill_inactive_zone at ffffffff80160eb2 15:03 <grondo> #17 [10208b399c0] get_writeback_state at ffffffff8015b730 15:03 <grondo> #18 [10208b399d0] get_dirty_limits at ffffffff8015b755 15:03 <grondo> #19 [10208b39a50] shrink_zone at ffffffff801614cc 15:03 <grondo> #20 [10208b39a80] fist_dprint_internal at ffffffffa00cde2d 15:03 <grondo> #21 [10208b39a90] fist_dprint_internal at ffffffffa00cde2d 15:03 <grondo> #22 [10208b39b00] fist_dprint_internal at ffffffffa00cde2d 15:03 <grondo> #23 [10208b39b10] vsnprintf at ffffffff801e2cf7 15:03 <grondo> #24 [10208b39b60] do_page_cache_readahead at ffffffff8015d41d 15:03 <grondo> #25 [10208b39be0] __up_read at ffffffff801e1f8d 15:03 <grondo> #26 [10208b39c00] shrink_slab at ffffffff80160dbe 15:03 <grondo> #27 [10208b39c40] try_to_free_pages at ffffffff801621b9 15:03 <grondo> #28 [10208b39cf0] __alloc_pages at ffffffff8015a6a7 15:03 <grondo> #29 [10208b39d60] do_no_page at ffffffff80165e4f 15:03 <grondo> #30 [10208b39dc0] handle_mm_fault at ffffffff801663a2 15:03 <grondo> #31 [10208b39e00] follow_page at ffffffff80164a1f 15:03 <grondo> #32 [10208b39e40] get_user_pages at ffffffff80166aa7 15:03 <grondo> #33 [10208b39e90] make_pages_present at ffffffff80166d0e 15:03 <grondo> #34 [10208b39ec0] do_mmap_pgoff at ffffffff80169411 15:04 <grondo> #35 [10208b39f40] sys_mmap at ffffffff80115a75 15:04 <grondo> #36 [10208b39f80] system_call at ffffffff8010f262 This event sent from IssueTracker by woodard issue 101232 Larry, This works much much better but it doesn't work perfectly. We've done about 20 tests so far. The first 10 tests were all 8 cpus. In every case so far this has worked perfectly. The second 10 tests were 6 of the 8 CPUs. In the case when we bound the processes to the first 6 cpus, then about 1/5 times we hit the NMI watchdog failure mode. We have yet to see a problem when we don't bind the processes to the first 6 cpus. So something still seems to be odd with the case where 3 of the 4 nodes run out of memory and then all the CPUs seem to jump on the last node to get the memory that they are asking for. We think that the unbound case works becuase all four nodes are in use. We are going to continue testing it and try to get you additional information but there still seems to be some bad corner cases. This event sent from IssueTracker by woodard issue 101232 Created attachment 141093 [details]
Larry Woodman's test patch from IT 101232
JohnRay, what did Goldman start doing that suddenly caused this to start happening? This has been seen before at a few customer sites but I just want to understand why they just started seeing it now. And yes, the nmi_watchdog=0 workaround will prevent the system from panicing when/if this happens. Larry Woodman Larry, They did not roll out Dual Core DL585 systems until after the RHEL 3 DL585 crash issue was resolved in late September. So, in effect, Four-way Dual Core RHEL 4 boxes are new to their environment. Most of the problems appear to be in the database group (mostly Sybase). I hope that helps. J 16:27 <grondo> More info on node hangs during OOM kill 16:28 <grondo> looks like the same symptoms as before, all CPUs are stuck in refill_inactive_zone 16:28 <grondo> However, it only seems to happen ~10% of the time 16:28 <grondo> however, for a big job running on 256 nodes, that makes about 26 nodes go down in SLURM after OOM kill event 16:31 <neb> so this is with lwoodman's patch or without. 16:32 <grondo> with the patch. Without, I think we hit this same problem 100% of the time 16:32 <grondo> IIRC This event sent from IssueTracker by woodard issue 101232 Larry, I know this is one of your top items right now, but, when you get the chance, can you provide me with an update on where you are / what you're looking at with this? The natives are beginning to get a little restless over at LLNL. Thanks. This event sent from IssueTracker by kbaxley issue 101232 *** Bug 180041 has been marked as a duplicate of this bug. *** Is there a known reproducer for this? I have a system in the lab reasy to go, but I wanted to make sure the usemem.c program in IT 101232 works as advertised to reproduce the problem and verify that the instructions posted here: https://enterprise.redhat.com/issue-tracker/?module=issues&action=view&tid=101232&gid=975&view_type=lifoall#eid_1059748 Thanks, J So we have a reproducer that does trigger the problem with the nmi watchdog at the longer timeout. However we do not have a reproducer which triggers the problem even with Larry's patch. Hi Dave, It seems like we've got two duplicate Issue Tracker tickets open. I'm going to close this one and work will continue on the other (98198). I've subscribed you to that ticket, so you'll receive e-mail update notifications. Please feel free to re-open this ticket if you don't think it's appropriate to close it at this time. Thanks. -- Bryan Internal Status set to 'Resolved' Status set to: Closed by Tech Resolution set to: 'NotABug' This event sent from IssueTracker by bjmason issue 92095 I am currently testing some of the upstream changes that were made is area and that I backported to RHEL4. Basically the system is reactivating too many pages and when it gete into a serious memory deficit all of the memory is active and therefore all the cpus need to deactvate at the same time. This causes all cores to try to acquire the same lock and the system either takes an NMI watchdog panic or it hangs. Larry Woodman Larry, Sounds like good progress. In addition to this problem. We are having two other possibly related problems. Understanding how they are related and how the may be related to this issue is not yet clear to me. They could very well be three different manifestations of the same problem or they could be three distinct problems. I really can't tell at the moment. In any case, I think that it is worth mentioning to you in case they might give you insight into the problem. 1) We have that problem with Lustre where all the memory ends up being active. The Lustre guys are working on that now. 2) We have reports that a similar problem may be happening on management nodes which don't run Lustre and which do have swap. I don't have many details regarding this problem at the moment. That is one of the things that I'm going to be working on today. 3) There is another problem where lots of pages seem to be active and dirty but pdflush doesn't seem to be running to clean these pages. The common thread seems to be too many active pages. -ben Thats right and the reason for that is 1.)anonymous pages start out active and get re-activated when the system reclaims memory unless that activity is really heavy or unless you increase /proc/sys/vm/swappiness. 2.)pagecache hits and all filesystem writes activate the associated page. BTW Ben, do you have any vmcores form the NMI watchdog panic ??? Larry sorry Larry, no I don't since this was happening on production machines we quickly worked around the problem by increasing the watchdog timeout so we don't hit the problem the same way. We also have your patch in our kernel.
I have a test kernel to try out if anyone want to run it. Its in
>>>http://people.redhat.com/~lwoodman/GS/
I added a new /proc/sys/vm/pagcache tunable parameter that needs to be
lowered to 10 so the system favors reclaiming unmapped pagecache memory
over mapped memory. The default is 100 so if they dont set this new
parameter the kernel wont do anything new.
Larry
committed in stream U6 build 55.1. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/ adding to RHEL4.6 release notes (under "Kernel-Related Updates"): <quote> /proc/sys/vm/drop_caches added to clear pagecache and slabcache on demand </quote> please advise if any revisions are in order. thanks! A fix for this issue should have been included in the packages contained in the RHEL4.6 Beta released on RHN (also available at partners.redhat.com). Requested action: Please verify that your issue is fixed to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message to Issue Tracker and I will change the status for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager. A fix for this issue should have been included in the packages contained in the RHEL4.6 Beta released on RHN (also available at partners.redhat.com). Requested action: Please verify that your issue is fixed to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message to Issue Tracker and I will change the status for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager. A fix for this issue should have been included in the packages contained in the RHEL4.6-Snapshot1 on partners.redhat.com. Requested action: Please verify that your issue is fixed to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message about your test results to Issue Tracker. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager. A fix for this issue should be included in RHEL4.6-Snapshot2--available soon on partners.redhat.com. Please verify that your issue is fixed to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message about your test results to Issue Tracker. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager. A fix for this issue should have been included in the packages contained in the RHEL4.6-Snapshot3 on partners.redhat.com. Please verify that your issue is fixed to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message about your test results to Issue Tracker. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager. A fix for this issue should be included in the packages contained in RHEL4.6-Snapshot4--available now on partners.redhat.com. Please verify that your issue is fixed ASAP to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message about your test results to Issue Tracker. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager. A fix for this issue should be included in the packages contained in RHEL4.6-Snapshot5--available now on partners.redhat.com. Please verify that your issue is fixed ASAP to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message about your test results to Issue Tracker. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0791.html |